Nibble Block Format

ABSTRACT

A matrix multiplication system and method are provided. The system includes a memory that stores one or more weight tensors, a processor and a matrix multiply accelerator (MMA). The processor converts each weight tensor into an encoded block set that is stored in the memory. Each encoded block set includes a number of encoded blocks, and each encoded block includes a data field and an index field. The MMA converts each encoded block set into a reconstructed weight tensor, and convolves each reconstructed weight tensor and an input data tensor to generate an output data matrix.

BACKGROUND

The present disclosure relates to computer systems. More particularly, the present disclosure relates to a matrix multiplication system and method.

Artificial neural networks (ANNs), such as deep neural networks (DNNs), convolutional neural networks (CNNs), etc., are a popular solution to a wide array of challenging classification, recognition and regression problems. However, many ANN models require a large number of matrix calculations involving a large number of weights and activations, which presents a significant challenge with respect to access, storage and performance, particularly for mobile and other power or storage-constrained devices. An ANN hardware accelerator accelerates these matrix calculations, such as, for example, convolution operations performed by CNNs.

Generally, matrices may be classified as either sparse or dense. Most elements of a sparse matrix have a value of zero, while most elements of a dense matrix have a non-zero value. For the simple matrix multiplication operation C=A·B, when matrix A or matrix B is sparse, most of the matrix calculations will include a value of zero for at least one of the operands. When both matrix A and matrix B are sparse, an even greater number of matrix calculations will include a value of zero for at least one of the operands. Since multiplication by an operand that has a value of zero will always result in a product that has a value of zero, applying standard matrix multiplication techniques to sparse matrices is very inefficient due to the large number of operands that have a value of zero. Additionally, sparse matrices are allocated memory storage space in excess of their actual requirements due to the large number of elements that have a value of zero.

ANN matrices may be sparse or dense, with values that range from a minimum value to a maximum value, such as, for example, 0 to 255 for a matrix storing 8-bit unsigned integers, −128 to 127 for a matrix storing signed 8-bit integers, etc. Many ANN matrices, such as CNN weight matrices, include a large number of zero values, a number of small magnitude values, and a small number of large magnitude values. In other words, CNN weight matrices often have very sparse data with some small weight values and an occasional, and important, large weight value. Similar to sparse matrices in general, applying standard matrix multiplication techniques to many ANN matrices is very inefficient, and these matrices are allocated memory storage space in excess of their actual requirements.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A depicts an ANN, in accordance with an embodiment of the present disclosure.

FIG. 1B depicts a CNN, in accordance with an embodiment of the present disclosure.

FIG. 2A depicts a convolutional layer calculation for a CNN, in accordance with an embodiment of the present disclosure.

FIG. 2B depicts a converted convolutional layer calculation for a CNN, in accordance with an embodiment of the present disclosure.

FIG. 3 depicts a data flow diagram for a MAC array, in accordance with an embodiment of the present disclosure.

FIGS. 4A, 4B and 4C depict weight tensors, in accordance with an embodiment of the present disclosure.

FIGS. 5A, 5B and 5C depict basic block sets, in accordance with an embodiment of the present disclosure.

FIGS. 6A, 6B and 6C depict basic block matrix sets, in accordance with an embodiment of the present disclosure.

FIG. 7A depicts a matrix element encoding process, according to an embodiment of the present disclosure.

FIG. 7B depicts a basic block encoding process, according to an embodiment of the present disclosure.

FIGS. 8A, 8B, 8C and 8D depict a basic block encoding process for a basic block matrix set, according to an embodiment of the present invention.

FIG. 9 depicts a data flow diagram for a portion of a training process for a CNN, according to embodiments of the present disclosure.

FIG. 10 depicts a block diagram of a system, in accordance with an embodiment of the present disclosure.

FIG. 11 depicts a block diagram of a matrix multiply accelerator (MMA), in accordance with an embodiment of the present disclosure.

DETAILED DESCRIPTION

Embodiments of the present disclosure will now be described with reference to the drawing figures, in which like reference numerals refer to like parts throughout.

Deep neural network inference involves operating on large tensors. Loading and manipulating this data tends to dominate the power consumption of inference tasks, even when running on custom hardware, such as an neural processing unit (NPU), graphics processing unit (GPU), etc. The cost may be advantageously reduced for ANN weight tensors, for example, by quantizing the weights to reduce the number of bits per weight, and pruning the weights to remove small values at or close to zero.

Embodiments of the present disclosure advantageously provide a matrix encoding process that reduces storage requirements and allows for flexibility in both quantization and pruning within a fixed block size format. ANN data, such as ANN weight tensors, tend to have a large number of zeros, a smaller number of non-zeros, and an even smaller number of large magnitude non-zero values. Embodiments of the present disclosure advantageously provide a block-based encoding process for ANN data, such as weight tensors, that is of fixed storage size, but allows trade-off in the tensor elements between zero values, small non-zero values and large non-zero values. Many embodiments of the present disclosure also provide a fixed computation size per block, which is advantageous in many situations.

In one embodiment, a system includes a memory, a processor coupled to the memory and a matrix multiply accelerator (MMA) coupled to the processor and the memory. The memory is configured to store one or more weight tensors, each weight tensor including a number of weights. The processor is configured, for each weight tensor, to generate, based on the weight tensor, a basic block matrix set including a number of basic block matrices, each basic block matrix including a number of weights; to generate, based on the basic block matrix set, an encoded block set, the encoded block set including a number of encoded blocks, each encoded block including a data field and an index field, the data field including a number of encoded weights, the index field including an index associated with each weight in the basic block matrix, the number of encoded weights being less than the number of weights in the basic block matrix, each encoded block having a same size; and to store the encoded block set in the memory. The MMA is configured to convert each encoded block set into a reconstructed weight tensor having a number of weights equal to the number of weights of the respective weight tensor, and convolve each reconstructed weight tensor and an input data tensor to generate an output data matrix.

An ANN models the relationships between input data or signals and output data or signals using a network of interconnected nodes that is trained through a learning process. The nodes are arranged into various layers, including, for example, an input layer, one or more hidden layers, and an output layer. The input layer receives input data, such as, for example, image data, and the output layer generates output data, such as, for example, a probability that the image data contains a known object. Each hidden layer provides at least a partial transformation of the input data to the output data. A DNN has multiple hidden layers in order to model complex, nonlinear relationships between input data and output data.

In a fully-connected, feedforward ANN, each node is connected to all of the nodes in the preceding layer, as well as to all of the nodes in the subsequent layer. For example, each input layer node is connected to each hidden layer node, each hidden layer node is connected to each input layer node and each output layer node, and each output layer node is connected to each hidden layer node. Additional hidden layers are similarly interconnected. Each connection has a weight value, and each node has an activation function, such as, for example, a linear function, a step function, a sigmoid function, a tanh function, a rectified linear unit (ReLU) function, etc., that determines the output of the node based on the weighted sum of the inputs to the node. The input data propagates from the input layer nodes, through respective connection weights to the hidden layer nodes, and then through respective connection weights to the output layer nodes.

More particularly, at each input node, input data is provided to the activation function for that node, and the output of the activation function is then provided as an input data value to each hidden layer node. At each hidden layer node, the input data value received from each input layer node is multiplied by a respective connection weight, and the resulting products are summed or accumulated into an activation value that is provided to the activation function for that node. The output of the activation function is then provided as an input data value to each output layer node. At each output layer node, the output data value received from each hidden layer node is multiplied by a respective connection weight, and the resulting products are summed or accumulated into an activation value that is provided to the activation function for that node. The output of the activation function is then provided as output data. Additional hidden layers may be similarly configured to process data.

FIG. 1A depicts ANN 10, in accordance with an embodiment of the present disclosure.

ANN 10 includes input layer 20, one or more hidden layers 30, 40, 50, etc., and output layer 60. Input layer 20 includes one or more input nodes 21, 22, 23, etc. Hidden layer 30 includes one or more hidden nodes 31, 32, 33, 34, 35, etc. Hidden layer 40 includes one or more hidden nodes 41, 42, 43, 44, 45, etc. Hidden layer 50 includes one or more hidden nodes 51, 52, 53, 54, 55, etc. Output layer 60 includes one or more output nodes 61, 62, etc. Generally, ANN 10 includes N hidden layers, input layer 20 includes “i” nodes, hidden layer 30 includes “j” nodes, hidden layer 40 includes “k” nodes, hidden layer 50 includes “m” nodes, and output layer 60 includes “o” nodes.

In one embodiment, N equals 3, i equals 3, j, k and m equal 5 and o equals 2 (depicted in FIG. 1A). Input node 21 is coupled to hidden nodes 31 to 35, input node 22 is coupled to hidden nodes 31 to 35, and input node 23 is coupled to hidden nodes 31 to 35. Hidden node 31 is coupled to hidden nodes 41 to 45, hidden node 32 is coupled to hidden nodes 41 to 45, hidden node 33 is coupled to hidden nodes 41 to 45, hidden node 34 is coupled to hidden nodes 41 to 45, and hidden node 35 is coupled to hidden nodes 41 to 45. Hidden node 41 is coupled to hidden nodes 51 to 55, hidden node 42 is coupled to hidden nodes 51 to 55, hidden node 43 is coupled to hidden nodes 51 to 55, hidden node 44 is coupled to hidden nodes 51 to 55, and hidden node 45 is coupled to hidden nodes 51 to 55. Hidden node 51 is coupled to output nodes 61 and 62, hidden node 52 is coupled to output nodes 61 and 62, hidden node 53 is coupled to output nodes 61 and 62, hidden node 54 is coupled to output nodes 61 and 62, and hidden node 55 is coupled to output nodes 61 and 62.

Many other variations of input, hidden and output layers are clearly possible, including hidden layers that are locally-connected, rather than fully-connected, to one another.

Training an ANN includes optimizing the connection weights between nodes by minimizing the prediction error of the output data until the ANN achieves a particular level of accuracy. One method is backpropagation, or backward propagation of errors, which iteratively and recursively determines a gradient descent with respect to the connection weights, and then adjusts the connection weights to improve the performance of the network.

A multi-layer perceptron (MLP) is a fully-connected ANN that has an input layer, an output layer and one or more hidden layers. MLPs may be used for natural language processing applications, such as machine translation, speech recognition, etc. Other ANNs include recurrent neural networks (RNNs), long short-term memories (LSTMs), sequence-to-sequence models that include an encoder RNN and a decoder RNN, shallow neural networks, etc.

A CNN is a variation of an MLP that may be used for classification or recognition applications, such as image recognition, speech recognition, etc. Typically, native convolution operations are not performed by a CNN due to the complicated dataflow and expensive datapaths that are usually required. Instead, native convolution operations are converted into generic matrix multiplication (GEMM) operations, and then the GEMM operations are executed more efficiently by a central processing unit (CPU), specialized processor, hardware accelerator processing engine, etc., using optimized software libraries or specialized hardware. More particularly, an “IM2COL” software function may be used to convert the filter (weight) matrix and the input feature map (IFM) matrix for each convolution operation into an expanded format that is compatible with a GEMM operation. The IM2COL versions of each filter (weight) matrix and each IFM matrix are generated and stored in memory, and then loaded from memory and processed by the GEMM operation.

A CNN has an input layer, an output layer and multiple hidden layers including convolutional layers, pooling layers, normalization layers, fully-connected layers, etc. Each convolutional layer applies a sliding dot product or cross-correlation to an input volume, applies an activation function to the results, and then provides the activation or output volume to the next layer. Convolutional layers typically use the ReLU function as the activation function. In certain embodiments, the activation function is provided in a separate activation layer, such as, for example, a ReLU layer. A pooling layer reduces the dimensions of the output volume received from the preceding convolutional layer, and may calculate an average or a maximum over small clusters of data, such as, for example, 2×2 matrices. In certain embodiments, a convolutional layer and a pooling layer may form a single layer of a CNN. The fully-connected layers follow the convolutional and pooling layers, and include a flatten layer and a classification layer, followed by a normalization layer that includes a normalization function, such as the SoftMax function. The output layer follows the last fully-connected layer; in certain embodiments, the output layer may include the normalization function.

FIG. 1B depicts CNN 15, in accordance with an embodiment of the present disclosure.

CNN 15 includes input layer 20, one or more hidden layers, such as convolutional layer 30-1, pooling layer 30-2, hidden (flatten) layer 40, hidden (classification) layer 50, etc., and output layer 60. Many other variations of input, hidden and output layers are contemplated.

Input layer 20 includes one or more input nodes 21, etc., that present the input data, such as a color image, as an input volume to the first convolutional layer, e.g., convolutional layer 30-1. The input volume is a three-dimensional matrix that has a width, a height and a depth. For example, input data that represent a color image are presented as an input volume that is 512 pixels×512 pixels×3 channels (red, green, blue); other input volume dimensions may also be used, such as 32×32×3, 64×64×3, 128×128×3, etc., 32×32×1, 64×64×1, 128×128×1, 512×512×1, etc.

Convolutional layer 30-1 is locally-connected to input layer 20, and includes a plurality of nodes that are connected to local regions in the input volume (not depicted for clarity). For a CNN that uses a standard convolution, each node computes a dot product between the node's weights and the respective local region of the input volume. An activation function is then applied to the results of each convolution calculation to produce an output volume that is provided as an input volume to the subsequent layer. The activation function may be applied by each convolutional layer node or by the nodes of a subsequent locally-connected ReLU layer.

Pooling layer 30-2 is locally-connected to convolutional layer 30-1, and includes a plurality of nodes that are connected to local regions in the input volume (not depicted for clarity). Pooling layer 30-2 also produces an output volume that is provided as the input volume to the subsequent layer, such as, for example, another convolutional layer 30-1, a flatten layer 40, etc. In certain embodiments, convolutional layer 30-1 and pooling layer 30-2 form a single hidden layer 30. Similarly, in certain embodiments, convolutional layer 30-1, a ReLU layer and pooling layer 30-2 form a single hidden layer 30. Generally, the output volumes of the convolutional and pooling layers may be described as feature maps, and one or more single hidden layers 30 form a feature learning portion of CNN 15.

Hidden layer 40 is a “flatten” layer that is locally-connected to pooling layer 30-2, and includes one or more hidden (flatten) nodes 41, 42, 43, 44, 45, etc. Hidden (flatten) layer 40 “flattens” the output volume produced by the preceding pooling layer 30-2 into a column vector, which is provided to the subsequent, fully-connected hidden layer 50.

Hidden layer 50 is a classification layer that is fully-connected to hidden (flatten) layer 40, and includes one or more hidden (classification) nodes 51, 52, 53, 54, 55, etc.

Output layer 60 includes one or more output nodes 61, 62, etc., and is fully-connected to hidden (classification) layer 50. Fully-connected output layer 60 receives the classification results output by hidden (classification) layer 50, and each node outputs a predicted class score. A normalization function, such as a Softmax function, may be applied to the predicted class scores by output layer 60, or, alternatively, by an additional layer interposed between hidden (classification) layer 50 and output layer 60.

Similar to ANNs, training a CNN includes optimizing the connection weights between nodes by minimizing the prediction error of the output data until the CNN achieves a particular level of accuracy. As noted above, backpropagation may be used to iteratively and recursively determines a gradient descent with respect to the connection weights, and then adjusts the connection weights to improve the performance of the network. Matrix multiplication operations, and, more particularly, multiply-and-accumulate (MAC) operations, are used extensively by CNNs, as well as other ANNs.

FIG. 2A depicts convolutional layer calculation 200 for a CNN, in accordance with an embodiment of the present disclosure.

Input feature maps form an input data tensor 204 that includes eight input channels and one input data matrix for each channel, i.e., input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸. Filter 202 includes three filter or weight tensors 202 ¹, 202 ² and 202 ³, and each filter or weight tensor includes eight weight matrices, one weight matrix for each channel, i.e., weight tensor 202 ¹ includes weight matrices 202 ¹ ₁, 202 ¹ ₂, 202 ¹ ₃, 202 ¹ ₄, 202 ¹ ₅, 202 ¹ ₆, 202 ¹ ₇ and 202 ¹ ₈, weight tensor 202 ² includes weight matrices 202 ² ₁, 202 ² ₂, 202 ² ₃, 202 ² ₄, 202 ² ₅, 202 ² ₆, 202 ² ₇ and 202 ² ₈, and weight tensor 202 ³ includes weight matrices 202 ³ ₁, 202 ³ ₂, 202 ³ ₃, 202 ³ ₄, 202 ³ ₅, 202 ³ ₆, 202 ³ ₇ and 202 ³ ₈. Output feature maps form an output data tensor 206 that includes three output channels and one output data matrix for each filter or weight tensor, i.e., output data matrices 206 ¹, 206 ² and 206 ³. Convolutional layer calculation 200 convolves weight tensors 202 ¹, 202 ² and 202 ³ with input data tensor 204 to produce output data tensor 206.

Each tensor has a height, a width and a depth. The depth of the input data tensor is equal to the number of input channels, the depth of each weight tensor is equal to the number of input channels, and the depth of the output tensor is equal to the number of output channels, i.e., the number of weight tensors in the filter. While the particular dimensions for the tensors and constituent matrices have been selected for clarity of illustration and explanation, embodiments of the present disclosure are not so limited. For example, a convolutional layer calculation may include four input channels and one output channel, a filter with one weight tensor with four weight matrices, an input data tensor with four input data matrices, and an output data tensor with one output data matrix. In this example, the weight tensor is convolved with the input data tensor to generate the output data matrix.

In one embodiment, each input data matrix 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸ is a 6×6 matrix associated with a different input channel and includes 36 activations. For example, input data matrix 204 ¹ is associated with the first input channel and includes activations a¹ ₁, a¹ ₂, a¹ ₃, a¹ ₄, a¹ ₅, a¹ ₆, a¹ ₇, a¹ ₈, a¹ ₉, a¹ ₁₀, a¹ ₁₁, a¹ ₁₂, a¹ ₁₃, a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₁₈, a¹ ₁₉, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃, a¹ ₂₄, a¹ ₂₅, a¹ ₂₆, a¹ ₂₇, a¹ ₂₈, a¹ ₂₉, a¹ ₃₀, a¹ ₃₁, a¹ ₃₂, a¹ ₃₃, a¹ ₃₄, a¹ ₃₅ and a¹ ₃₆, input data matrix 204 ² is associated with the second input channel and includes activations a² ₁, a² ₂, a² ₃, a² ₄, a² ₅, a² ₆, a² ₇, a² ₈, a² ₉, a² ₁₀, a² ₁₁, a² ₂, a² ₁₃, a² ₁₄, a² ₁₅, a² ₁₆, a² ₁₇, a² ₁₈, a² ₁₉, a² ₂₀, a² ₂₁, a² ₂₂, a² ₂₃, a² ₂₄, a² ₂₅, a² ₂₆, a² ₂₇, a² ₂₈, a² ₂₉, a² ₃₀, a² ₃₁, a² ₃₂, a² ₃₃, a² ₃₄, a² ₃₅ and a² ₃₆, and so on.

In this embodiment, each weight tensor 202 ¹, 202 ² and 202 ³ includes eight, 4×4 weight matrices, and each weight matrix is associated with a different input channel. Weight tensor 202 ¹ includes weight matrices 202 ¹ ₁, 202 ¹ ₂, 202 ¹ ₃, 202 ¹ ₄, 202 ¹ ₅, 202 ¹ ₆, 202 ¹ ₇ and 202 ¹ ₈. The first weight matrix 202 ¹ ₁ is a 4×4 matrix associated with the first channel, and includes weights w¹ ₁, w¹ ₂, w¹ ₂, w¹ ₄, w¹ ₅, w¹ ₆, w¹ ₇, w¹ ₈, w¹ ₉, w¹ ₁₀, w¹ ₁₁, w¹¹ ₂, w¹ ₁₃, w¹ ₁₄, w¹ ₁₅ and w¹ ₁₆, the second weight matrix 202 ¹ ₂ is a 4×4 matrix associated with the second channel, and includes weights w² ₁, w² ₂, w² ₃, w² ₄, w² ₅, w² ₆, w² ₇, w² ₈, w² ₉, w² ₁₀, w² ₁₁, w² ₁₂, w² ₁₃, w² ₁₄, w² ₁₅ and w² ₁₆, and so on.

Weight tensor 202 ² includes weight matrices 202 ² ₁, 202 ² ₂, 202 ² ₃, 202 ² ₄, 202 ² ₅, 202 ² ₆, 202 ² ₇ and 202 ² ₈. The first weight matrix 202 ² ₁ is a 4×4 matrix associated with the first channel, and includes weights x¹ ₁, x¹ ₂, x¹ ₃, x¹ ₄, x¹ ₅, x¹ ₆, x¹ ₇, x¹ ₈, x¹ ₉, x¹ ₁₀, x¹ ₁₁, x¹ ₁₂, x¹ ₁₃, x¹ ₁₄, x¹ ₁₅ and x¹ ₁₆, the second weight matrix 202 ¹ ₂ is a 4×4 matrix associated with the second channel, and includes weights x² ₁, x² ₂, x² ₃, x² ₄, x² ₅, x² ₆, x² ₇, x² ₈, x² ₉, x² ₁₀, x² ₁₁, x² ₁₂, x² ₁₃, x² ₁₄, x² ₁₅ and x² ₁₆, and so on.

Weight tensor 202 ³ includes weight matrices 202 ³ ₁, 202 ³ ₂, 202 ³, 202 ³ ₄, 202 ³ ₅, 202 ³ ₆, 202 ³ ₇ and 202 ³ ₈. The first weight matrix 202 ³ ₁ is a 4×4 matrix associated with the first channel, and includes weights y¹ ₁, y¹ ₂, y¹ ₃, y¹ ₄, y¹ ₄, y¹ ₅, y¹ ₆, y¹ ₇, y¹ ₈, y¹ ₉, y¹ ₁₀, y¹ ₁₁, y¹¹ ₂, y¹ ₁₃, y¹ ₁₄, y¹ ₁₅ and y¹ ₁₆, the second weight matrix 202 ¹ ₂ is a 4×4 matrix associated with the second channel, and includes weights y² ₁, y² ₂, y² ₃, y² ₄, y² ₅, y² ₆, y² ₇, y² ₈, y² ₉, y² ₁₀, y² ₁₁, y² ₁₂, y² ₁₃, y² ₁₄, y² ₁₅ and y² ₁₆, and so on.

In this embodiment, each output data matrix 206 ¹, 206 ² and 206 ³ is a 3×3 matrix associated with a different output channel and includes 9 output elements. For example, output data matrix 206 ¹ is associated with the first output channel and includes outputs o¹ ₁, o¹ ₂, o¹ ₃, o¹ ₄, o¹ ₅, o¹ ₆, o¹ ₇, o¹ ₈ and o¹ ₉, output data matrix 206 ² is associated with the second output channel and includes outputs o² ₁, o² ₂, o² ₃, o² ₄, o² ₅, o² ₆, o² ₇, o² ₈ and o² ₉, and output data matrix 206 ³ is associated with the third output channel and includes outputs o³ ₁, o³ ₂, o³ ₃, o³ ₄, o³ ₅, o³ ₆, o³ ₇, o³ ₈ and o³ ₉.

For ease of explanation, each input data matrix input data matrix 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸ may be divided into three quadrants. The first quadrant spans the top (first) row, second row third row and fourth row, the second quadrant spans the second row, third row, fourth row and fifth row, and the third quadrant spans the third row, fourth row, fifth row and sixth (bottom) row. The first quadrant for input data matrix 204 ¹ (a¹ _(q1)) and the first quadrant for input data matrix 204 ⁸ (a⁸ _(q1)) are labeled; the remaining quadrants for each input data matrix are not labeled for clarity. The elements of each quadrant for input data matrix 204 ¹ are described as follows, and the quadrants of input data matrices 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸ are similarly arranged.

First quadrant a¹ _(q1) includes activations a¹ ₁, a¹ ₂, a¹ ₃, a¹ ₄, a¹ ₅, a¹ ₆, a¹ ₇, a¹ ₈, a¹ ₉, a¹ ₁₀, a¹ ₁₁, a¹ ₁₂, a¹ ₁₃, a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₁₈, a¹ ₁₉, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃ and a¹ ₂₄, from which three blocks of activations are formed, i.e., a first block (i.e., activations a¹ ₁, a¹ ₂, a¹ ₃, a¹ ₄, a¹ ₇, a¹ ₈, a¹ ₉, a¹ ₁₀, a¹ ₁₃, a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₉, a¹ ₂₀, a¹ ₂₁ and a¹ ₂₂), a second block (i.e., activations a¹ ₂, a¹ ₃, a¹ ₄, a¹ ₅, a¹ ₈, a¹ ₉, a¹ ₁₀, a¹ ₁₁, a¹ ₁₄, a¹ ₁, a¹ ₁₆, a¹ ₁₇, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂ and a¹ ₂₃), and a third block (i.e., activations a¹ ₃, a¹ ₄, a¹ ₅, a¹ ₆, a¹ ₉, a¹ ₁₀, a¹ ₁₁, a¹ ₁₂, a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₁₈, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃ and a¹ ₂₄).

Second quadrant a¹ _(q2) includes activations a¹ ₇, a¹ ₈, a¹ ₉, a¹ ₁₀, a¹ ₁₁, a¹ ₁₂, a¹ ₁₃, a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₁₈, a¹ ₁₉, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃, a¹ ₂₄, a¹ ₂₅, a¹ ₂₆, a¹ ₂₇, a¹ ₂₈, a¹ ₂₉ and a¹ ₃₀, from which three blocks of activations are formed, i.e., a first block (i.e., activations a¹ ₇, a¹ ₈, a¹ ₉, a¹ ₁₀, a¹ ₁₃, a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₉, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂, a¹ ₂₅, a¹ ₂₆, a¹ ₂₇ and a¹ ₂₈), a second block (i.e., activations a¹ ₈, a¹ ₉, a¹ ₁₀, a¹ ₁₁, a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃, a¹ ₂₆, a¹ ₂₇, a¹ ₂₈ and a¹ ₂₉), and a third block (i.e., activations a¹ ₉, a¹ ₁₀, a¹ ₁₁, a¹ ₁₂, a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₁₈, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃, a¹ ₂₄, a¹ ₂₇, a¹ ₂₈, a¹ ₂₉ and a¹ ₃₀).

Third quadrant a¹ _(q3) includes activations a¹ ₁₃, a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₁₈, a¹ ₁₉, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃, a¹ ₂₄, a¹ ₂₅, a¹ ₂₆, a¹ ₂₇, a¹ ₂₈, a¹ ₂₉, a¹ ₃₀, a¹ ₃₁, a¹ ₃₂, a¹ ₃₃, a¹ ₃₄, a¹ ₃₅ and a¹ ₃₆, from which three blocks of activations are formed, i.e., a first block (i.e., activations a¹ ₁₃, a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₉, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂, a¹ ₂₅, a¹ ₂₆, a¹ ₂₇, a¹ ₂₈, a¹ ₃₁, a¹ ₃₂, a¹ ₃₃ and a¹ ₃₄), a second block (i.e., activations a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃, a¹ ₂₆, a¹ ₂₇, a¹ ₂₈, a¹ ₂₉, a¹ ₃₂, a¹ ₃₃, a¹ ₃₄ and a¹ ₃₅), and a third block (i.e., activations a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₁₈, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃, a¹ ₂₄, a¹ ₂₇, a¹ ₂₈, a¹ ₂₉, a¹ ₃₀, a¹ ₃₃, a¹ ₃₄, a¹ ₃₅ and a¹ ₃₆).

Output data tensor 206 may also be divided into three quadrants; in this case, each quadrant spans all three output data matrices 206 ¹, 206 ² and 206 ³. The first quadrant spans the top (first) row of each output data matrix, the second quadrant spans the second row of each output data matrix, and the third quadrant spans the fourth (bottom) row of each output data matrix. The first quadrant for output data matrices 206 ¹, 206 ² and 206 ³ (i.e., o¹ _(q1), o² _(q1) and o³ _(q1)) is labeled; the remaining quadrants are not labeled for clarity.

For output data matrix 206 ¹, the first quadrant o¹ _(q1) includes o¹ ₁, o¹ ₂, o¹ ₃, the second quadrant o¹ _(q2) includes o¹ ₄, o¹ ₅ and o¹ ₆, and the third quadrant o¹ _(q3) includes o¹ ₇, o¹ ₈ and o¹ ₉. For output data matrix 206 ², the first quadrant o² _(q1) includes o² ₁, o² ₂, o² ₃, the second quadrant o² _(q2) includes o² ₄, o² ₅ and o² ₆, and the third quadrant o² _(q3) includes o² ₇, o² ₈ and o² ₉. For output data matrix 206 ³, the first quadrant o³ _(q1) includes o³ ₁, o³ ₂, o³ ₃, the second quadrant o³ _(q2) includes o³ ₄, o³ ₅ and o³ ₆, and the third quadrant o³ _(q3) includes o³ ₇, o³ ₈ and o³ ₉.

Generally, each output element within output data matrices 206 ¹, 206 ² and 206 ³ is the sum of the dot products of one of the weight tensors 202 ¹, 202 ² and 202 ³ and a block of activation elements within a particular quadrant of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸.

The calculation of the output elements in first quadrants o¹ _(q1), o² _(q1) and o³ _(q1) follows.

Output element o¹ ₁ of output data matrix 206 ¹ is the sum of the dot products of weight tensor 202 ¹, i.e., weight matrices 202 ¹ ₁, 202 ¹ ₂, 202 ¹ ₃, 202 ¹ ₄, 202 ¹ ₅, 202 ¹ ₆, 202 ¹ ₇ and 202 ¹ ₈, and the first block of activation elements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1), a⁴ _(q1), a⁵ _(q1), a⁶ _(q1), a⁷ _(q1) and a⁸ _(q1) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. Generally, each weight matrix 202 ¹ ₁ includes weights w^(i) ₁, w^(i) ₂, w^(i) ₃, w^(i) ₄, w^(i) ₅, w^(i) ₆, w^(i) ₇, w^(i) ₈, w^(i) ₉, w^(i) ₁₀, w^(i) ₁₁, w^(i) ₁₂, w^(i) ₁₃, w^(i) ₁₄, w^(i) ₁₅ and w^(i) ₁₆, and the first block of activation elements within the first quadrant a^(i) _(q1) of each input data matrix 204 includes a^(i) ₁, a^(i) ₂, a^(i) ₃, a^(i) ₄, a^(i) ₇, a^(i) ₈, a^(i) ₉, a^(i) ₁₀, a^(i) ₁₃, a^(i) ₁₄, a^(i) ₁₅, a^(i) ₁₆, a^(i) ₁₉, a^(i) ₂₀, a^(i) ₂₁ and a^(i) ₂₂.

For example, weight matrix 202 ¹ ₁ includes weights w¹ ₁, w¹ ₂, w¹ ₃, w¹ ₄, w¹ ₅, w¹ ₆, w¹ ₇, w¹ ₈, w¹ ₉, w¹ ₁₀, w¹ ₁₁, w¹ ₁₂, w¹ ₁₃, w¹ ₁₄, w¹ ₁₅ and w ¹ ₁₆, weight matrix 202 ² ₁ includes weights w² ₁, w² ₂, w² ₃, w² ₄, w² ₅, w² ₆, w² ₇, w² ₈, w² ₉, w² ₁₀, w² ₁₁, w² ₁₂, w² ₁₃, w² ₁₄, w² ₁₅ and w² ₁₆, and so on, while the first block of activation elements within the first quadrant a¹ _(q1) of input data matrix 204 ¹ includes a¹ ₁, a¹ ₂, a¹ ₃, a¹ ₄, a¹ ₇, a¹ ₈, a¹ ₉, a¹ ₁₀, a¹ ₁₃, a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₉, a¹ ₂₀, a¹ ₂₁ and a¹ ₂₂, the first block of activation elements within the first quadrant a² _(q1) of input data matrix 204 ² includes a² ₁, a² ₂, a² ₃, a² ₄, a² ₇, a² ₈, a² ₉, a² ₁₀, a² ₁₃, a² ₁₄, a² ₁₅, a² ₁₆, a² ₁₉, a² ₂₀, a² ₂₁ and a² ₂₂, and so on.

More particularly, the following dot products are summed to generate output element o¹ ₁: the dot product of the first weight matrix 202 ¹ ₁ of weight tensor 202 ¹ and the first block of quadrant a¹ _(q1) of input data matrix 204 ¹ (i.e., w¹ ₁·a¹ ₁+w¹ ₂·a¹ ₂+w¹ ₃·a¹ ₃+w¹ ₄·a¹ ₄+w¹ ₅·a¹ ₇+w¹ ₆·a¹ ₈+w¹ ₇·a¹ ₉+w¹ ₈·a¹ ₁₀+w¹ ₉·a¹ ₁₃+w¹ ₁₀·a¹ ₁₄+w¹ ₁₁·a¹ ₁₅+w¹ ₁₂·a¹ ₁₆+w¹ ₁₃·a¹ ₁₉+w¹ ₁₄·a¹ ₂₀+w¹ ₁₅·a¹ ₂₁+w¹ ₁₆·a¹ ₂₂), the dot product of the second weight matrix 202 ¹ ₂ of weight tensor 202 ¹ and the first block of quadrant a² _(q1) of input data matrix 204 ² (i.e., w² ₁·a² ₁+w² ₂·a² ₂+w² ₃·a² ₃+w² ₄·a² ₄+w² ₅·a² ₇+w² ₆·a² ₈+w² ₇·a² ₉+w² ₈·a² ₁₀+w² ₉·a² ₁₃+w² ₁₀·a² ₁₄+w² ₁₁·a² ₁₅+w² ₁₂·a² ₁₆+w² ₁₃·a² ₁₉+w² ₁₄·a² ₂₀+w² ₁₅·a² ₂₁+w² ₁₆·a² ₂₂), the dot product of the third weight matrix 202 ¹ ₃ of weight tensor 202 ¹ and the first block of quadrant a³ _(q1) of input data matrix 204 ³ (i.e., w³ ₁·a³ ₁+w³ ₂·a³ ₂+w³ ₃·a³ ₃+w³ ₄·a³ ₄+w³ ₅·a³ ₇+w³ ₆·a³ ₈+w³ ₇·a³ ₉+w³ ₈·a³ ₁₀+w³ ₉·a³ ₁₃+w³ ₁₀·a³ ₁₄+w³ ₁₁·a³ ₁₅+w³ ₁₂·a³ ₁₆+w³ ₁₃·a³ ₁₉+w³ ₁₄·a³ ₂₀+w³ ₁₅·a³ ₂₁+w³ ₁₆·a³ ₂₂), the dot product of the fourth weight matrix 202 ¹ ₄ of weight tensor 202 ¹ and the first block of quadrant a⁴ _(q1) of input data matrix 204 ⁴ (i.e., w⁴ ₁·a⁴ ₁+w⁴ ₂·a⁴ ₂+w⁴ ₃·a⁴ ₃+w⁴ ₄·a⁴ ₄+w⁴ ₅·a⁴ ₇+w⁴ ₆·a⁴ ₈+w⁴ ₇·a⁴ ₉+w⁴ ₈·a⁴ ₁₀+w⁴ ₉·a⁴ ₁₃+w⁴ ₁₀·a⁴ ₁₄+w⁴ ₁₁·a⁴ ₁₅+w⁴ ₁₂·a⁴ ₁₆+w⁴ ₁₃·a⁴ ₁₉+w⁴ ₁₄·a⁴ ₂₀+w⁴ _(15 *421)+w⁴ ₁₆·a⁴ ₂₂), the dot product of the fifth weight matrix 202 ¹ ₅ of weight tensor 202 ¹ and the first block of quadrant a⁵ _(q1) of input data matrix 204 ⁵ (i.e., w⁵ ₁·a⁵ ₁+w⁵ ₂·a⁵ ₂+w⁵ ₃·a⁵ ₃+w⁵ ₄·a⁵ ₄+w⁵ ₅·a⁵ ₇+w⁵ ₆·a⁵ ₈+w⁵ ₇·a⁵ ₉+w⁵ ₈·a⁵ ₁₀+w⁵ ₉·a⁵ ₁₃+w⁵ ₁₀·a⁵ ₁₄+w⁵ ₁₁·a⁵ ₁₅+w⁵ ₁₂·a⁵ ₁₆+w⁵ ₁₃·a⁵ ₁₉+w⁵ ₁₄·a⁵ ₂₀+w⁵ ₁₅·a⁵ ₂₁+w⁵ ₁₆·a⁵ ₂₂), the dot product of the sixth weight matrix 202 ¹ ₆ of weight tensor 202 ¹ and the first block of quadrant a⁶ _(q1) of input data matrix 204 ⁶ (i.e., w⁶ ₁·a⁶ ₁+w⁶ ₂·a⁶ ₂+w⁶ ₃·a⁶ ₃+w⁶ ₄·a⁶ ₄+w⁶ ₅·a⁶ ₇+w⁶ ₆·a⁶ ₈+w⁶ ₇·a⁶ ₉+w⁶ ₈·a⁶ ₁₀+w⁶ ₉·a⁶ ₁₃+w⁶ ₁₀·a⁶ ₁₄+w⁶ ₁₁·a⁶ ₁₅+w⁶ ₁₂·a⁶ ₁₆+w⁶ ₁₃·a⁶ ₁₉+w⁶ ₁₄·a⁶ ₂₀+w⁶ ₁₅·a⁶ ₂₁+w⁶ ₁₆·a⁶ ₂₂), the dot product of the seventh weight matrix 202 ¹ ₇ of weight tensor 202 ¹ and the first block of quadrant a⁷ _(q1) of input data matrix 204 ⁷ (i.e., w⁷ ₁·a⁷ ₁+w⁷ ₂·a⁷ ₂+w⁷ ₃·a⁷ ₃+w⁷ ₄·a⁷ ₄+w⁷ ₅·a⁷ ₇+w⁷ ₆·a⁷ ₈+w⁷ ₇·a⁷ ₉+w⁷ ₈·a⁷ ₁₀+w⁷ ₉·a⁷ ₁₃+w⁷ ₁₀·a⁷ ₁₄+w⁷ ₁₁·a⁷ ₁₅+w⁷ ₁₂·a⁷ ₁₆+w⁷ ₁₃·a⁷ ₁₉+w⁷ ₁₄·a⁷ ₂₀+w⁷ ₁₅·a⁷ ₂₁+w⁷ ₁₆·a⁷ ₂₂), and the dot product of the eighth weight matrix 202 ¹ ₈ of weight tensor 202 ¹ and the first block of quadrant a⁸ _(q1) of input data matrix 204 ⁸ (i.e., w⁸ ₁·a⁸ ₁+w⁸ ₂·a⁸ ₂+w⁸ ₃·a⁸ ₃+w⁸ ₄·a⁸ ₄+w⁸ ₅·a⁸ ₇+w⁸ ₆·a⁸ ₈+w⁸ ₇·a⁸ ₉+w⁸ ₈·a⁸ ₁₀+w⁸ ₉·a⁸ ₁₃+w⁸ ₁₀·a⁸ ₁₄+w⁸ ₁₁·a⁸ ₁₅+w⁸ ₁₂·a⁸ ₁₆+w⁸ ₁₃·a⁸ ₁₉+w⁸ ₁₄·a⁸ ₂₀+w⁸ ₁₅·a⁸ ₂₁+w⁸ ₁₆·a⁸ ₂₂).

Similarly, output element o² ₁ of output data matrix 206 ² is the sum of the dot products of weight tensor 202 ², i.e., weight matrices 202 ² ₁, 202 ² ₂, 202 ² ₃, 202 ² ₄, 202 ² ₅, 202 ² ₆, 202 ² ₇ and 202 ² ₈, and the first block of activation elements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1), a⁴ _(q1), a⁵ _(q1), a⁶ _(q1), a⁷ _(q1) and a⁸ _(q1) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

And, output element o³ ₁ of output data matrix 206 ³ is the sum of the dot products of weight tensor 202 ³, i.e., weight matrices 202 ³ ₁, 202 ³ ₂, 202 ³ ₃, 202 ³ ₄, 202 ³ ₅, 202 ³ ₆, 202 ³ ₇ and 202 ³ ₈, and the first block of activation elements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1), a⁴ _(q1), a⁵ _(q1), a⁶ _(q1), a⁷ _(q1) and a⁸ _(q1) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

Output element o¹ ₂ of output data matrix 206 ¹ is the sum of the dot products of weight tensor 202 ¹, i.e., weight matrices 202 ¹ ₁, 202 ¹ ₂, 202 ¹ ₃, 202 ¹ ₄, 202 ¹ ₅, 202 ¹ ₆, 202 ¹ ₇ and 202 ¹ ₈, and the second block of activation elements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1), a⁴ _(q1), a⁵ _(q1), a⁶ _(q1), a⁷ _(q1) and a⁸ _(q1) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. Generally, the second block of activation elements within the first quadrant a^(i) _(q1) of each input data matrix 204 ^(i) includes a^(i) ₂, a^(i) ₃, a^(i) ₄, a^(i) ₅, a^(i) ₈, a^(i) ₉, a^(i) ₁₀, a^(i) ₁₁, a^(i) ₁₄, a^(i) ₁₅, a^(i) ₁₆, a^(i) ₁₇, a^(i) ₂₀, a^(i) ₂₁, a^(i) ₂₂ and a^(i) ₂₃. For example, the second block of activation elements within the first quadrant a¹ _(q1) of input data matrix 204 ¹ includes a¹ ₂, a¹ ₃, a¹ ₄, a¹ ₅, a¹ ₈, a¹ ₉, a¹ ₁₀, a¹ ₁₁, a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂ and a¹ ₂₃, the second block of activation elements within the first quadrant a² _(q1) of input data matrix 204 ² includes a² ₂, a² ₃, a² ₄, a² ₅, a² ₈, a² ₉, a² ₁₀, a² ₁₁, a² ₁₄, a² ₁₅, a² ₁₆, a² ₁₇, a² ₂₀, a² ₂₁, a² ₂₂ and a² ₂₃, and so on.

More particularly, the following dot products are summed to generate output element o¹ ₂: the dot product of the first weight matrix 202 ¹ ₁ of weight tensor 202 ¹ and the second block of quadrant a¹ _(q1) of input data matrix 204 ¹ (i.e., w¹ ₁·a¹ ₂+w¹ ₂·a¹ ₃+w¹ ₃·a¹ ₄+w¹ ₄·a¹ ₅+w¹ ₅·a¹ ₈+w¹ ₆·a¹ ₉+w¹ ₇·a¹ ₁₀+w¹ ₈·a¹ ₁₁+w¹ ₉·a¹ ₁₄+w¹ ₁₀·a¹ ₁₅+w¹ ₁₁·a¹ ₁₆+w¹ ₁₂·a¹ ₁₇+w¹ ₁₃·a¹ ₂₀+w¹ ₁₄·a¹ ₂₁+w¹ ₁₅·a¹ ₂₂+w¹ ₁₆·a¹ ₂₃), the dot product of the second weight matrix 202 ¹ ₂ of weight tensor 202 ¹ and the second block of quadrant a² _(q1) of input data matrix 204 ² (i.e., w² ₁·a² ₂+w² ₂·a² ₃+w² ₃·a² ₄+w² ₄·a² ₅+w² ₅·a² ₈+w² ₆·a² ₉+w² ₇·a² ₁₀+w² ₈·a² ₁₁+w² ₉·a² ₁₄+w² ₁₀·a² ₁₅+w² ₁₁·a² ₁₆+w² ₁₂·a² ₁₇+w² ₁₃·a² ₂₀+w² ₁₄·a² ₂₁+w² ₁₅·a² ₂₂+w² ₁₆·a² ₂₃), the dot product of the third weight matrix 202 ¹ ₃ of weight tensor 202 ¹ and the second block of quadrant a³ _(q1) of input data matrix 204 ³ (i.e., w³ ₁·a³ ₂+w³ ₂·a³ ₃+w³ ₃·a³ ₄+w³ ₄·a³ ₅+w³ ₅·a³ ₈+w³ ₆·a³ ₉+w³ ₇·a³ ₁₀+w³ ₈·a³ ₁₁+w³ ₉·a³ ₁₄+w³ ₁₀·a³ ₁₅+w³ ₁₁·a³ ₁₆+w³ ₁₂·a³ ₁₇+w³ ₁₃·a³ ₂₀+w³ ₁₄·a³ ₂₁+w³ ₁₅·a³ ₂₂+w³ ₁₆·a³ ₂₃), the dot product of the fourth weight matrix 202 ¹ ₄ of weight tensor 202 ¹ and the second block of quadrant a⁴ _(q1) of input data matrix 204 ⁴ (i.e., w⁴ ₁·a⁴ ₂+w⁴ ₂·a⁴ ₃+w⁴ ₃·a⁴ ₄+w⁴ ₄·a⁴ ₅+w⁴ ₅·a⁴ ₈+w⁴ ₆·a⁴ ₉+w⁴ ₇·a⁴ ₁₀+w⁴ ₈·a⁴ ₁₁+w⁴ ₉·a⁴ ₁₄+w⁴ ₁₀·a⁴ ₁₅+w⁴ ₁₁·a⁴ ₁₆+w⁴ _(12 *417)+w⁴ ₁₃·a⁴ ₂₀+w⁴ ₁₄·a⁴ ₂₁+w⁴ ₁₅·a⁴ ₂₂+w⁴ ₁₆·a⁴ ₂₃), the dot product of the fifth weight matrix 202 ¹ ₅ of weight tensor 202 ¹ and the second block of quadrant a⁵ _(q1) of input data matrix 204 ⁵ (i.e., w⁵ ₁·a⁵ ₂+w⁵ ₂·a⁵ ₃+w⁵ ₃·a⁵ ₄+w⁵ ₄·a⁵ ₅+w⁵ ₅·a⁵ ₈+w⁵ ₆·a⁵ ₉+w⁵ ₇·a⁵ ₁₀+w⁵ ₈·a⁵ ₁₁+w⁵ ₉·a⁵ ₁₄+w⁵ ₁₀·a⁵ ₁₅+w⁵ ₁₁·a⁵ ₁₆+w⁵ ₁₂·a⁵ ₁₇+w⁵ ₁₃·a⁵ ₂₀+w⁵ ₁₄·a⁵ ₂₁+w⁵ ₁₅·a⁵ ₂₂+w⁵ ₁₆·a⁵ ₂₃), the dot product of the sixth weight matrix 202 ¹ ₆ of weight tensor 202 ¹ and the second block of quadrant a⁶ _(q1) of input data matrix 204 ⁶ (i.e., w⁶ ₁·a⁶ ₂+w⁶ ₂·a⁶ ₃+w⁶ ₃·a⁶ ₄+w⁶ ₄·a⁶ ₅+w⁶ ₅·a⁶ ₈+w⁶ ₆·a⁶ ₉+w⁶ ₇·a⁶ ₁₀+w⁶ ₈·a⁶ ₁₁+w⁶ ₉·a⁶ ₁₄+w⁶ ₁₀·a⁶ ₁₅+w⁶ ₁·a⁶ ₁₆+w⁶ ₁₂·a⁶ ₁₇+w⁶ ₁₃·a⁶ ₂₀+w⁶ ₁₄·a⁶ ₂₁+w⁶ ₁₅·a⁶ ₂₂+w⁶ ₁₆·a⁶ ₂₃), the dot product of the seventh weight matrix 202 ¹ ₇ of weight tensor 202 ¹ and the second block of quadrant a⁷ _(q1) of input data matrix 204 ⁷ (i.e., w⁷ ₁·a⁷ ₂+w⁷ ₂·a⁷ ₃+w⁷ ₃·a⁷ ₄+w⁷ ₄·a⁷ ₅+w⁷ ₅·a⁷ ₈+w⁷ ₆·a⁷ ₉+w⁷ ₇·a⁷ ₁₀+w⁷ ₈·a⁷ ₁₁₀+w⁷ ₉·a⁷ ₁₄+w⁷ ₁₀·a⁷ ₁₅+w⁷ ₁₁·a⁷ ₁₆+w⁷ ₁₂·a⁷ ₁₇+w⁷ ₁₃·a⁷ ₂₀+w⁷ ₁₄·a⁷ ₂₁+w⁷ ₁₅·a⁷ ₂₂+w⁷ ₁₆·a⁷ ₂₃), and the dot product of the eighth weight matrix 202 ¹ ₈ of weight tensor 202 ¹ and the second block of quadrant a⁸ _(q1) of input data matrix 204 ⁸ (i.e., w⁸ ₁·a⁸ ₂+w⁸ ₂·a⁸ ₃+w⁸ ₃·a⁸ ₄+w⁸ ₄·a⁸ ₅+w⁸ ₅·a⁸ ₈+w⁸ ₆·a⁸ ₉+w⁸ ₇·a⁸ ₁₀+w⁸ ₈·a⁸ ₁₁+w⁸ ₉·a⁸ ₁₄+w⁸ ₁₀·a⁸ ₁₅+w⁸ ₁₁·a⁸ ₁₆+w⁸ ₁₂·a⁸ ₁₇+w⁸ ₁₃·a⁸ ₂₀+w⁸ ₁₄·a⁸ ₂₁₀+w⁸ ₁₅·a⁸ ₂₂+w⁸ ₁₆·a⁸ ₂₃).

Similarly, output element o² ₂ of output data matrix 206 ² is the sum of the dot products of weight tensor 202 ², i.e., weight matrices 202 ² ₁, 202 ² ₂, 202 ² ₃, 202 ² ₄, 202 ² ₅, 202 ² ₆, 202 ² ₇ and 202 ² ₈, and the second block of activation elements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1), a⁴ _(q1), a⁵ _(q1), a⁶ _(q1), a⁷ _(q1) and a⁸ _(q1) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

And, output element o³ ₂ of output data matrix 206 ³ is the sum of the dot products of weight tensor 202 ³, i.e., weight matrices 202 ³ ₁, 202 ³ ₂, 202 ³ ₃, 202 ³ ₄, 202 ³ ₅, 202 ³ ₆, 202 ³ ₇ and 202 ³ ₈, and the second block of activation elements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1), a⁴ _(q1), a⁵ _(q1), a⁶ _(q1), a⁷ _(q1) and a⁸ _(q1) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

Output element o¹ ₃ of output data matrix 206 ¹ is the sum of the dot products of weight tensor 202 ¹, i.e., weight matrices 202 ¹ ₁, 202 ¹ ₂, 202 ¹ ₃, 202 ¹ ₄, 202 ¹ ₅, 202 ¹ ₆, 202 ¹ ₇ and 202 ¹ ₈, and the third block of activation elements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1), a⁴ _(q1), a⁵ _(q1), a⁶ _(q1), a⁷ _(q1) and a⁸ _(q1) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. Generally, the third block of activation elements within the first quadrant a¹ _(q1) of each input data matrix 204 ^(i) includes a^(i) ₃, a^(i) ₄, a^(i) ₅, a^(i) ₆, a^(i) ₉, a^(i) ₁₀, a^(i) ₁₁, a^(i) ₁₂, a^(i) ₁₅, a^(i) ₁₆, a^(i) ₁₇, a^(i) ₁₈, a^(i) ₂₁, a^(i) ₂₂, a^(i) ₂₃ and a^(i) ₂₄. For example, the third block of activation elements within the first quadrant a¹ _(q1) of input data matrix 204 ¹ includes a¹ ₃, a¹ ₄, a¹ ₅, a¹ ₆, a¹ ₉, a¹ ₁₀, a¹ ₁₁, a¹ ₁₂, a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₁₈, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃ and a¹ ₂₄, the third block of activation elements within the first quadrant a² _(q1) of input data matrix 204 ² includes a² ₃, a² ₄, a² ₅, a² ₆, a² ₉, a² ₁₀, a² ₁₁, a² ₁₂, a² ₁₅, a² ₁₆, a² ₁₇, a² ₁₈, a² ₂₁, a² ₂₂, a² ₂₃ and a² ₂₄, and so on.

More particularly, the following dot products are summed to generate output element o¹ ₃: the dot product of the first weight matrix 202 ¹ ₁ of weight tensor 202 ¹ and the third block of quadrant a¹ _(q1) of input data matrix 204 ¹ (i.e., w¹ ₁·a¹ ₃+w¹ ₂·a¹ ₄+w¹ ₃·a¹ ₅+w¹ ₄·a¹ ₆+w¹ ₅·a¹ ₉+w¹ ₆·a¹ ₁₀+w¹ ₇·a¹ ₁₁+w¹ ₈·a¹ ₁₂+w¹ ₉·a¹ ₁₅+w¹ ₁₀·a¹ ₁₆+w¹ ₁₁·a¹ ₁₇+w¹ ₁₂·a¹ ₁₈+w¹ ₁₃·a¹ ₂₁+w¹ ₁₄·a¹ ₂₂+w¹ ₁₅·a¹ ₂₃+w¹ ₁₆·a¹ ₂₄), the dot product of the second weight matrix 202 ¹ ₂ of weight tensor 202 ¹ and the third block of quadrant a² _(q1) of input data matrix 204 ² (i.e., w² ₁·a² ₃+w² ₂·a² ₄+w² ₃·a² ₅+w² ₄·a² ₆+w² ₅·a² ₉+w² ₆·a² ₁₀+w² ₇·a² ₁₁+w² ₈·a² ₁₂+w² ₉·a² ₁₅+w² ₁₀·a² ₁₆+w² ₁₁·a² ₁₇+w² ₁₂·a² ₁₈+w² ₁₃·a² ₂₁+w² ₁₄·a² ₂₂+w² ₁₅·a² ₃+w² ₁₆·a² ₂₄), the dot product of the third weight matrix 202 ¹ ₃ of weight tensor 202 ¹ and the third block of quadrant a³ _(q1) of input data matrix 204 ³ (i.e., w³ ₁·a³ ₃+w³ ₂·a³ ₄+w³ ₃·a³ ₅+w³ ₄·a³ ₆+w³ ₅·a³ ₉+w³ ₆·a³ ₁₀+w³ ₇·a³ ₁₁+w³ ₈·a³ ₁₂+w³ ₉·a³ ₁₅+w³ ₁₀·a³ ₁₆+w³ ₁₁·a³ ₁₇+w³ ₁₂·a³ ₁₈+w³ ₁₃·a³ ₂₁+w³ ₁₄·a³ ₂+w³ ₁₅·a³ ₂+w³ ₁₆·a³ ₂₄), the dot product of the fourth weight matrix 202 ¹ ₄ of weight tensor 202 ¹ and the third block of quadrant a⁴ _(q1) of input data matrix 204 ⁴ (i.e., w⁴ ₁·a⁴ ₃+w⁴ ₂·a⁴ ₄+w⁴ ₃·a⁴ ₅+w⁴ ₄·a⁴ ₆+w⁴ ₅·a⁴ ₉+w⁴ ₆·a⁴ ₁₀+w⁴ ₇·a⁴ ₁₁+w⁴ ₈·a⁴ ₁₂+w⁴ ₉·a⁴ ₁₅+w⁴ ₁₀·a⁴ ₁₆+w⁴ ₁₁·a⁴ ₁₇+w⁴ ₁₂·a⁴ ₁₈+w⁴ ₁₃·a⁴ ₂₁+w⁴ ₁₄·a⁴ ₂₂+w⁴ ₁₅·a⁴ ₂₃+w⁴ ₁₆·a⁴ ₂₄), the dot product of the fifth weight matrix 202 ¹ ₅ of weight tensor 202 ¹ and the third block of quadrant a⁵ _(q1) of input data matrix 204 ⁵ (i.e., w⁵ ₁·a⁵ ₃+w⁵ ₂·a⁵ ₄+w⁵ ₃·a⁵ ₅+w⁵ ₄·a⁵ ₆+w⁵ ₅·a⁵ ₉+w⁵ ₆·a⁵ ₁₀+w⁵ ₇·a⁵ ₁₁+w⁵ ₈·a⁵ ₁₂+w⁵ ₉·a⁵ ₁₅+w⁵ ₁₀·a⁵ ₁₆+w⁵ ₁₁·a⁵ ₁₇+w⁵ ₁₂·a⁵ ₁₈+w⁵ ₁₃·a⁵ ₂₁+w⁵ ₁₄·a⁵ ₂₂+w⁵ ₁₅·a⁵ ₂₃+w⁵ ₁₆·a⁵ ₂₄), the dot product of the sixth weight matrix 202 ¹ ₆ of weight tensor 202 ¹ and the third block of quadrant a⁶ _(q1) of input data matrix 204 ⁶ (i.e., w⁶ ₁·a⁶ ₃+w⁶ ₂·a⁶ ₄+w⁶ ₃·a⁶ ₅+w⁶ ₄·a⁶ ₆+w⁶ ₅·a⁶ ₉+w⁶ ₆·a⁶ ₁₀+w⁶ ₇·a⁶ ₁₁+w⁶ ₈·a⁶ ₁₂+w⁶ ₉·a⁶ ₁₅+w⁶ ₁₀·a⁶ ₁₆+w⁶ ₁₁·a⁶ ₁₇+w⁶ ₁₂·a⁶ ₁₈+w⁶ ₁₃·a⁶ ₂₁+w⁶ ₁₄·a⁶ ₂₂+w⁶ ₁₅·a⁶ ₂₃+w⁶ ₁₆·a⁶ ₂₄), the dot product of the seventh weight matrix 202 ¹ ₇ of weight tensor 202 ¹ and the third block of quadrant a⁷ _(q1) of input data matrix 204 ⁷ (i.e., w⁷ ₁·a⁷ ₃+w⁷ ₂·a⁷ ₄+w⁷ ₃·a⁷ ₅+w⁷ ₄·a⁷ ₆+w⁷ ₅·a⁷ ₉+w⁷ ₆·a⁷ ₁₀+w⁷ ₇·a⁷ ₁₁+w⁷ ₈·a⁷ ₁₂+w⁷ ₉·a⁷ ₁₅+w⁷ ₁₀·a⁷ ₁₆+w⁷ ₁₁·a⁷ ₁₇+w⁷ ₁₂·a⁷ ₁₈+w⁷ _(13 *721)+w⁷ ₁₄·a⁷ ₂₂+w⁷ ₁₅·a⁷ ₂₃+w⁷ ₁₆·a⁷ ₂₄), and the dot product of the eighth weight matrix 202 ¹ ₈ of weight tensor 202 ¹ and the third block of quadrant a⁸ _(q1) of input data matrix 204 ⁸ (i.e., w⁸ ₁·a⁸ ₃+w⁸ ₂·a⁸ ₄+w⁸ ₃·a⁸ ₅+w⁸ ₄·a⁸ ₆+w⁸ ₅·a⁸ ₉+w⁸ ₆·a⁸ ₁₀+w⁸ ₇·a⁸ ₁₁+w⁸ ₈·a⁸ ₁₂+w⁸ ₉·a⁸ ₁₅+w⁸ ₁₀·a⁸ ₁₆+w⁸ ₁₁·a⁸ ₁₇+w⁸ ₁₂·a⁸ ₁₈+w⁸ ₁₃·a⁸ ₂₁+w⁸ ₁₄·a⁸ ₂₂+w⁸ ₁₅·a⁸ ₂₃+w⁸ ₁₆·a⁸ ₂₄).

Similarly, output element o² ₃ of output data matrix 206 ² is the sum of the dot products of weight tensor 202 ², i.e., weight matrices 202 ² ₁, 202 ² ₂, 202 ² ₃, 202 ² ₄, 202 ² ₅, 202 ² ₆, 202 ² ₇ and 202 ² ₈, and the third block of activation elements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1), a⁴ _(q1), a⁵ _(q1), a⁶ _(q1), a⁷ _(q1) and a⁸ _(q1) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

And, output element o³ of output data matrix 206 ³ is the sum of the dot products of weight tensor 202 ³, i.e., weight matrices 202 ³ ₁, 202 ³ ₂, 202 ³, 202 ³ ₄, 202 ³ ₅, 202 ³ ₆, 202 ³ ₇ and 202 ³ ₈, and the third block of activation elements within first quadrants a¹ _(q1), a² _(q1), a³ _(q1), a⁴ _(q1), a⁵ _(q1), a⁶ _(q1), a⁷ _(q1) and a⁸ _(q1) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

The calculation of the output elements in second quadrants o¹ _(q2), o² _(q2) and o³ _(q2) follows.

Output element o¹ ₄ of output data matrix 206 ¹ is the sum of the dot products of weight tensor 202 ¹, i.e., weight matrices 202 ¹ ₁, 202 ¹ ₂, 202 ¹ ₃, 202 ¹ ₄, 202 ¹ ₅, 202 ¹ ₆, 202 ¹ ₇ and 202 ¹ ₈, and the first block of activation elements within second quadrants a¹ _(q2), a² _(q2), a³ _(q2), a⁴ _(q2), a⁵ _(q2), a⁶ _(q2), a⁷ _(q2) and a⁸ _(q2) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. Similarly, output element o² ₄ of output data matrix 206 ² is the sum of the dot products of weight tensor 202 ², i.e., weight matrices 202 ² ₁, 202 ² ₂, 202 ² ₃, 202 ² ₄, 202 ² ₅, 202 ² ₆, 202 ² ₇ and 202 ² ₈, and the first block of activation elements within second quadrants a¹ _(q2), a² _(q2), a³ _(q2), a⁴ _(q2), a⁵ _(q2), a⁶ _(q2), a⁷ _(q2) and a⁸ _(q2) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. And, output element o³ ₄ of output data matrix 206 ³ is the sum of the dot products of weight tensor 202 ³, i.e., weight matrices 202 ³ ₁, 202 ³ ₂, 202 ³, 202 ³ ₄, 202 ³ ₅, 202 ³ ₆, 202 ³ ₇ and 202 ³ ₈, and the first block of activation elements within second quadrants a¹ _(q2), a² _(q2), a³ _(q2), a⁴ _(q2), a⁵ _(q2), a⁶ _(q2), a⁷ _(q2) and a⁸ _(q2) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

Output element o¹ ₅ of output data matrix 206 ¹ is the sum of the dot products of weight tensor 202 ¹, i.e., weight matrices 202 ¹ ₁, 202 ¹ ₂, 202 ¹ ₃, 202 ¹ ₄, 202 ¹ ₅, 202 ¹ ₆, 202 ¹ ₇ and 202 ¹ ₈, and the second block of activation elements within second quadrants a¹ _(q2), a² _(q2), a³ _(q2), a⁴ _(q2), a⁵ _(q2), a⁶ _(q2), a⁷ _(q2) and a⁸ _(q2) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. Similarly, output element o² ₅ of output data matrix 206 ² is the sum of the dot products of weight tensor 202 ², i.e., weight matrices 202 ² ₁, 202 ² ₂, 202 ² ₃, 202 ² ₄, 202 ² ₅, 202 ² ₆, 202 ² ₇ and 202 ² ₈, and the second block of activation elements within second quadrants a¹ _(q2), a² _(q2), a³ _(q2), a⁴ _(q2), a⁵ _(q2), a⁶ _(q2), a⁷ _(q2) and a⁸ _(q2) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

And, output element o³ ₅ of output data matrix 206 ³ is the sum of the dot products of weight tensor 202 ³, i.e., weight matrices 202 ³ ₁, 202 ³ ₂, 202 ³, 202 ³ ₄, 202 ³ ₅, 202 ³ ₆, 202 ³ ₇ and 202 ³ ₈, and the second block of activation elements within second quadrants a¹ _(q2), a² _(q2), a³ _(q2), a⁴ _(q2), a⁵ _(q2), a⁶ _(q2), a⁷ _(q2) and a⁸ _(q2) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

Output element o¹ ₆ of output data matrix 206 ¹ is the sum of the dot products of weight tensor 202 ¹, i.e., weight matrices 202 ¹ ₁, 202 ¹ ₂, 202 ¹ ₃, 202 ¹ ₄, 202 ¹ ₅, 202 ¹ ₆, 202 ¹ ₇ and 202 ¹ ₈, and the third block of activation elements within second quadrants a¹ _(q2), a² _(q2), a³ _(q2), a⁴ _(q2), a⁵ _(q2), a⁶ _(q2), a⁷ _(q2) and a⁸ _(q2) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. Output element o² ₆ of output data matrix 206 ² is the sum of the dot products of weight tensor 202 ², i.e., weight matrices 202 ² ₁, 202 ² ₂, 202 ² ₃, 202 ² ₄, 202 ² ₅, 202 ² ₆, 202 ² ₇ and 202 ² ₈, and the third block of activation elements within second quadrants a¹ _(q2), a² _(q2), a³ _(q2), a⁴ _(q2), a⁵ _(q2), a⁶ _(q2), a⁷ _(q2) and a⁸ _(q2) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. Output element o³ ₆ of output data matrix 206 ³ is the sum of the dot products of weight tensor 202 ³, i.e., weight matrices 202 ³, 202 ³ ₂, 202 ³, 202 ³ ₄, 202 ³ ₅, 202 ³ ₆, 202 ³ ₇ and 202 ³ ₈, and the third block of activation elements within second quadrants a¹ _(q2), a² _(q2), a³ _(q2), a⁴ _(q2), a⁵ _(q2), a⁶ _(q2), a¹ _(q2) and a⁸ _(q2) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

The calculation of the output elements in third quadrants o¹ _(q3), o² _(q3) and o³ _(q3) follows.

Output element o¹ ₇ of output data matrix 206 ¹ is the sum of the dot products of weight tensor 202 ¹, i.e., weight matrices 202 ¹ ₁, 202 ¹ ₂, 202 ¹ ₃, 202 ¹ ₄, 202 ¹ ₅, 202 ¹ ₆, 202 ¹ ₇ and 202 ¹ ₈, and the first block of activation elements within third quadrants a¹ _(q3), a² _(q3), a³ _(q3), a⁴ _(q3), a⁵ _(q3), a⁶ _(q3), a⁷ _(q3) and a⁸ _(q3) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. Output element o² ₇ of output data matrix 206 ² is the sum of the dot products of weight tensor 202 ², i.e., weight matrices 202 ² ₁, 202 ² ₂, 202 ² ₃, 202 ² ₄, 202 ² ₅, 202 ² ₆, 202 ² ₇ and 202 ² ₈, and the first block of activation elements within third quadrants a¹ _(q3), a² _(q3), a³ _(q3), a⁴ _(q3), a⁵ _(q3), a⁶ _(q3), a⁷ _(q3) and a⁸ _(q3) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. And, output element o³ ₄ of output data matrix 206 ³ is the sum of the dot products of weight tensor 202 ³, i.e., weight matrices 202 ³ ₁, 202 ³ ₂, 202 ³, 202 ³ ₄, 202 ³ ₅, 202 ³ ₆, 202 ³ ₇ and 202 ³ ₈, and the first block of activation elements within third quadrants a¹ _(q3), a² _(q3), a³ _(q3), a⁴ _(q3), a⁵ _(q3), a⁶ _(q3), a⁷ _(q3) and a⁸ _(q3) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

Output element o¹ ₈ of output data matrix 206 ¹ is the sum of the dot products of weight tensor 202 ¹, i.e., weight matrices 202 ¹ ₁, 202 ¹ ₂, 202 ¹ ₃, 202 ¹ ₄, 202 ¹ ₅, 202 ¹ ₆, 202 ¹ ₇ and 202 ¹ ₈, and the second block of activation elements within third quadrants a¹ _(q3), a² _(q3), a³ _(q3), a⁴ _(q3), a⁵ _(q3), a⁶ _(q3), a⁷ _(q3) and a⁸ _(q3) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. Similarly, output element o² ₈ of output data matrix 206 ² is the sum of the dot products of weight tensor 202 ², i.e., weight matrices 202 ² ₁, 202 ² ₂, 202 ² ₃, 202 ² ₄, 202 ² ₅, 202 ² ₆, 202 ² ₇ and 202 ² ₈, and the second block of activation elements within third quadrants a¹ _(q3), a² _(q3), a³ _(q3), a⁴ _(q3), a⁵ _(q3), a⁶ _(q3), a⁷ _(q3) and a⁸ _(q3) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. And, output element o³ ₈ of output data matrix 206 ³ is the sum of the dot products of weight tensor 202 ³, i.e., weight matrices 202 ³ ₁, 202 ³ ₂, 202 ³, 202 ³ ₄, 202 ³ ₅, 202 ³ ₆, 202 ³ ₇ and 202 ³ ₈, and the second block of activation elements within third quadrants a¹ _(q3), a² _(q3), a³ _(q3), a⁴ _(q3), a⁵ _(q3), a⁶ _(q3), a⁷ _(q3) and a⁸ _(q3) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

Output element o¹ ₉ of output data matrix 206 ¹ is the sum of the dot products of weight tensor 202 ¹, i.e., weight matrices 202 ¹ ₁, 202 ¹ ₂, 202 ¹ ₃, 202 ¹ ₄, 202 ¹ ₅, 202 ¹ ₆, 202 ¹ ₇ and 202 ¹ ₈, and the third block of activation elements within third quadrants a¹ _(q3), a² _(q3), a³ _(q3), a⁴ _(q3), a⁵ _(q3), a⁶ _(q3), a⁷ _(q3) and a⁸ _(q3) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. Output element o² ₉ of output data matrix 206 ² is the sum of the dot products of weight tensor 202 ², i.e., weight matrices 202 ² ₁, 202 ² ₂, 202 ² ₃, 202 ² ₄, 202 ² ₅, 202 ² ₆, 202 ² ₇ and 202 ² ₈, and the third block of activation elements within third quadrants a¹ _(q3), a² _(q3), a³ _(q3), a⁴ _(q3), a⁵ _(q3), a⁶ _(q3), a⁷ _(q3) and a⁸ _(q3) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively. Output element o³ ₉ of output data matrix 206 ³ is the sum of the dot products of weight tensor 202 ³, i.e., weight matrices 202 ³ ₁, 202 ³ ₂, 202 ³, 202 ³ ₄, 202 ³ ₅, 202 ³ ₆, 202 ³ ₇ and 202 ³ ₈, and the third block of activation elements within third quadrants a¹ _(q3), a² _(q3), a³ _(q3), a⁴ _(q3), a⁵ _(q3), a⁶ _(q3), a⁷ _(q3) and a⁸ _(q3) of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷ and 204 ⁸, respectively.

FIG. 2B depicts converted convolutional layer calculation 210 for a CNN, in accordance with an embodiment of the present disclosure.

In one embodiment, the convolutional layer calculations for CNNs executing on central processor units (CPUs), GPUs, NPUs, etc. may be converted into generic matrix multiplication (GEMM) operations, which may leverage GEMM-optimized software libraries. Convolution layer calculation 200 is converted into a GEMM operation by converting filter 202 into converted weight matrix 212, converting input data tensor 204 into converted input data matrix 214, and then multiplying converted weight matrix 212 and converted input data matrix 214 to generate converted output data matrix 216. Because simple matrix multiplication is performed rather than a convolution operation, each output element within converted output data matrix 216 is the dot product of one row of converted weight matrix 212 and one column of converted input data matrix 214. Converted output data matrix 216 is then reformed into output data tensor 206.

Converted weight matrix 212 is a 3×128 matrix, and includes converted weight tensors 212 ¹, 212 ² and 212 ³. Weight tensor 202 ¹ is flattened to form converted weight tensor 212 ¹, i.e., the first row, weight tensor 202 ² is flattened to form converted weight tensor 212 ², i.e., the second row, and weight tensor 202 ³ is flattened to form converted weight tensor 212 ³, i.e., the third row.

The first row of converted weight matrix 212 includes weights w¹ ₁, w¹ ₂, w¹ ₃, w¹ ₄, w¹ ₅, w¹ ₆, w¹ ₇, w¹ ₈, w¹ ₉, w¹ ₁₀, w¹ ₁₁, w¹ ₁₂, w¹ ₁₃, w¹ ₁₄, w¹ ₁₅, w¹ ₁₆, w² ₁, w² ₂, w² ₃, w² ₄ w² ₅, w² ₆, w² ₇, w² ₈, w² ₉, w² ₁₀, w² ₁₁, w² ₁₂, w² ₁₃, w² ₁₄, w² ₁₅, w² ₁₆, w³ ₁, w³ ₂, w³ ₃, w³ ₄, w³ ₅, w³ ₆, w³ ₇, w³ ₈, w³ ₉, w³ ₁₀, w³ ₁₁, w³ ₁₂, w³ ₁₃, w³ ₁₄, w³ ₁₅, w³ ₁₆, w⁴ ₁, w⁴ ₂, w⁴ ₃, w⁴ ₄, w⁴ ₅, w⁴ ₆, w⁴ ₇, w⁴ ₈, w⁴ ₉, w⁴ ₁₀, w⁴ ₁₁, w⁴ ₁₂, w⁴ ₁₃, w⁴ ₁₄, w⁴ ₁₅, w⁴ ₁₆, w⁵ ₁, w⁵ ₂, w⁵ ₃, w⁵ ₄, w⁵ ₅, w⁵ ₆, w⁵ ₇, w⁵ ₈, w⁵ ₉ w⁵ ₁₀, w⁵ ₁₁, w⁵ ₁₂, w⁵ ₁₃, w⁵ ₁₄, w⁵ ₁₅, w⁵ ₁₆, w³ ₁, w⁶ ₂, w⁶ ₃, w⁶ ₄, w⁶ ₅, w⁶ ₆, w⁶ ₇, w⁶ ₈, w⁶ ₉, w⁶ ₁₀, w⁶ ₁₁, w⁶ ₁₂, w⁶ ₁₃, w⁶ ₁₄, w⁶ ₁₅, w⁶ ₁₆, w⁷ ₁, w⁷ ₂, w⁷ ₃, w⁷ ₄, w⁷ ₅, w⁷ ₆, w⁷ ₇, w⁷ ₈, w⁷ ₉, w⁷ ₁₀, w⁷ ₁₁, w⁷ ₁₂, w⁷ ₁₃, w⁷ ₁₄, w⁷ ₁₅, w⁷ ₁₆, w⁸ ₁, w⁸ ₂, w⁸ ₃, w⁸ ₄, w⁸ ₅, w⁸ ₆, w⁸ ₇, w⁸ ₈, w⁸ ₉, w⁸ ₁₀, w⁸ ₁₁, w⁸ ₁₂, w⁸ ₁₃, w⁸ ₁₄, w⁸ ₁₅ and w⁸ ₁₆.

The second row of converted weight matrix 212 includes weights x¹ ₁, x¹ ₂, x¹ ₃, x¹ ₄, x¹ ₅, x¹ ₆, x¹ ₇, x¹ ₈, x¹ ₉, x¹ ₁₀, x¹ ₁₁, x¹ ₁₂, x¹ ₁₃, x¹ ₁₄, x¹ ₁₅, x¹ ₁₆, x² ₁, x² ₂, x² ₃, x² ₄, x² ₅, x² ₆, x² ₇, x² ₈, x² ₉, x² ₁₀, x² ₁₁, x² ₁₂, x² ₁₃, x² ₁₄, x² ₁₅, x² ₁₆, x³ ₁, x³ ₂, x³ ₃, x³ ₄, x³ ₅, x³ ₆, x³ ₇, x³ ₈, x³ ₉, x³ ₁₀, x³ ₁₁, x³ ₁₂, x³ ₁₃, x³ ₁₄, x³ ₁₅, x³ ₁₆, x⁴ ₁, x⁴ ₂, x⁴ ₃, x⁴ ₄, x⁴ ₅, x⁴ ₆, x⁴ ₇, x⁴ ₈, x⁴ ₉, x⁴ ₁₀, x⁴ ₁₁, x⁴ ₁₂, x⁴ ₁₃, x⁴ ₁₄, x⁴ ₁₅, x⁴ ₁₆, x⁵ ₁, x⁵ ₂, x⁵ ₃, x⁵ ₄, x⁵ ₅, x⁵ ₆, x⁵ ₇, x⁵ ₈, x⁵ ₉, x⁵ ₁₀, x⁵ ₁₁, x⁵ ₁₂, x⁵ ₁₃, x⁵ ₁₄, x⁵ ₁₅, x⁵ ₁₆, x³ ₁, x⁶ ₂, x⁶ ₃, x⁶ ₄, x⁶ ₅, x⁶ ₆, x⁶ ₇, x⁶ ₈, x⁶ ₉, x⁶ ₁₀, x⁶ ₁₁, x⁶ ₁₂, x⁶ ₁₃, x⁶ ₁₄, x⁶ ₁₅, x⁶ ₁₆, x⁷ ₁, x⁷ ₂, x⁷ ₃, x⁷ ₄, x⁷ ₅, x⁷ ₆, x⁷ ₇, x⁷ ₈, x⁷ ₉, x⁷ ₁₀, x⁷ ₁₁, x⁷ ₁₂, x⁷ ₁₃, x⁷ ₁₄, x⁷ ₁₅, x⁷ ₁₆, x⁸ ₁, x⁸ ₂, x⁸ ₃, x⁸ ₄, x⁸ ₅, x⁸ ₆, x⁸ ₇, x⁸ ₈, x⁸ ₉, x⁸ ₁₀, x⁸ ₁₁, x⁸ ₁₂, x⁸ ₁₃, x⁸ ₁₄, x⁸ ₁₅ and x⁸ ₁₆.

The third row of converted weight matrix 212 includes weights y¹ ₁, y¹ ₂, y¹ ₃, y¹ ₄, y¹ ₅, y¹ ₆, y¹ ₇, y¹ ₈, y¹ ₉, y¹ ₁₀, y¹ ₁₁, y¹ ₁₂, y¹ ₁₃, y¹ ₁₄, y¹ ₁₅, y¹ ₁₆, y² ₁, y² ₂, y² ₃, y² ₄, y² ₅, y² ₆, y² ₇, y² ₈, y² ₉, y² ₁₀, y² ₁₁, y² ₁₂, y² ₁₃, y² ₁₄, y² ₁₅, y² ₁₆, y³ ₁, y³ ₂, y³ ₃, y³ ₄, y³ ₅, y³ ₆, y³ ₇, y³ ₈, y³ ₉, y¹ ₁₀, y³ ₁₁, y³ ₁₂, y³ ₁₃, y³ ₁₄, y³ ₁₅, y¹ ₆, y⁴ ₁, y⁴ ₂, y⁴ ₃, y⁴ ₄, y⁴ ₅, y⁴ ₆, y⁴ ₇, y⁴ ₈, y⁴ ₉, y⁴ ₁₀, y⁴ ₁₁, y⁴ ₁₂, y⁴ ₁₃, y⁴ ₁₄, y⁴ ₁₅, y⁴ ₁₆, y⁵ ₁, y⁵ ₂, y⁵ ₃, y⁵ ₄, y⁵ ₅, y⁵ ₆, y⁵ ₇, y⁵ ₈, y⁵ ₉, y⁵ ₁₀, y⁵ ₁₁, y⁵ ₁₂, y⁵ ₁₃, y⁵ ₁₄, y⁵ ₁₅, y⁵ ₁₆, y³ ₁, y⁶ ₂, y⁶ ₃, y⁶ ₄, y⁶ ₅, y⁶ ₆, y⁶ ₇, y⁶ ₈, y⁶ ₉, y⁶ ₁₀, y⁶ ₁₁, y⁶ ₁₂, y⁶ ₁₃, y⁶ ₁₄, y⁶ ₁₅, y⁶ ₁₆, y⁷ ₁, y⁷ ₂, y⁷ ₃, y⁷ ₄, y⁷ ₅, y⁷ ₆, y⁷ ₇, y⁷ ₈, y⁷ ₉, y⁷ ₁₀, y⁷ ₁₁, y⁷ ₁₂, y⁷ ₁₃, y⁷ ₁₄, y¹ ₅, y¹ ₆, y⁸ ₁, y⁸ ₂, y⁸ ₃, y⁸ ₄, y⁸ ₅, y⁸ ₆, y⁸ ₇, y⁸ ₈, y⁸ ₉, y⁸ ₁₀, y⁸ ₁₁, y⁸ ₁₂, y⁸ ₁₃, y⁸ ₁₄, y⁸ ₁₅ and y⁸ ₁₆.

Converted input data matrix 214 is a 128×9 matrix, and includes the blocks of each quadrant of input data matrices 204 ¹, 204 ², 204 ³, 204 ⁴, 204 ⁵, 204 ⁶, 204 ⁷, 204 ⁸, i.e., quadrants a¹ _(q1), a¹ _(q2), a¹ _(q3), a² _(q1), a² _(q2), a² _(q3), a³ _(q1), a³ _(q2), a³ _(q3), a⁴ _(q1), a⁴ _(q2), a⁴ _(q3), a⁵ _(q1), a⁵ _(q2), a⁵ _(q3), a⁶ _(q1), a⁶ _(q2), a⁶ _(q3), a⁷ _(q1), a⁷ _(q2), a⁷ _(q3), a⁸ _(q1), a⁸ _(q2), and a⁸ _(q3), respectively. Generally, each block is flattened to form a portion of a single column of converted input data matrix 214.

More particularly, the first column of converted input data matrix 214 includes the first blocks from quadrants a¹ _(q1), a² _(q1), a³ _(q1), a⁴ _(q1), a⁵ _(q1), a⁶ _(q1), a⁷ _(q1) and a⁸ _(q1), the second column of converted input data matrix 214 includes the second blocks from quadrants a¹ _(q1), a² _(q1), a³ _(q1), a⁴ _(q1), a⁵ _(q1), a⁶ _(q1), a⁷ _(q1) and a⁸ _(q1), and the third column of converted input data matrix 214 includes the third blocks from quadrants a¹ _(q1), a² _(q1), a³ _(q1), a⁴ _(q1), a⁵ _(q1), a⁶ _(q1), a⁷ _(q1) and a⁸ _(q1).

For example, the first column of converted input data matrix 214 includes activations a¹ ₁, a¹ ₂, a¹ ₃, a¹ ₄, a¹ ₇, a¹ ₈, a¹ ₉, a¹ ₁₀, a¹ ₁₃, a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₉, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂, a² ₁, a² ₂, a² ₃, a² ₄, a² ₇, a² ₈, a² ₉, a² ₁₀, a² ₁₃, a² ₄, a² ₁₅, a² ₁₆, a² ₁₉, a² ₂₀, a² ₂₁, a² ₂₂, a³ ₁, a³ ₂, a³ ₃, a³ ₄, a³ ₇, a³ ₈, a³ ₉, a³ ₁₀, a³ ₁₃, a³ ₁₄, a³ ₁₅, a³ ₁₆, a³ ₁₉, a³ ₂₀, a³ ₂₁, a³ ₂₂, a⁴ ₁, a⁴ ₂, a⁴ ₃, a⁴ ₄, a⁴ ₇, a⁴ ₈, a⁴ ₉, a⁴ ₁₀, a⁴ ₁₃, a⁴ ₁₄, a⁴ ₁₅, a⁴ ₁₆, a⁴ ₁₉, a⁴ ₂₀, a⁴ ₂₁, a⁴ ₂₂, a⁵ ₁, a⁵ ₂, a⁵ ₃, a⁵ ₄, a⁵ ₇, a⁵ ₈, a⁵ ₉, a⁵ ₁₀, a⁵ ₁₃, a⁵ ₁₄, a⁵ ₁₅, a⁵ ₁₆, a⁵ ₁₉, a⁵ ₂₀, a⁵ ₂₁, a⁵ ₂₂, a⁶ ₁, a⁶ ₂, a⁶ ₃, a⁶ ₄, a⁶ ₇, a⁶ ₈, a⁶ ₉, a⁶ ₁₀, a⁶ ₁₃, a⁶ ₁₄, a⁶ ₁₅, a⁶ ₁₆, a⁶ ₁₉, a⁶ ₂₀, a⁶ ₂₁, a⁶ ₂₂, a⁷ ₁, a⁷ ₂, a⁷ ₃, a⁷ ₄, a⁷ ₇, a⁷ ₈, a⁷ ₉, a⁷ ₁₀, a⁷ ₁₃, a⁷ ₁₄, a⁷ ₁₅, a⁷ ₁₆, a⁷ ₁₉, a⁷ ₂₀, a⁷ ₂₁, a⁷ ₂₂, a⁸, a⁸ ₂, a⁸ ₃, a⁸ ₄, a⁸ ₇, a⁸ ₈, a⁸ ₉, a⁸ ₁₀, a⁸ ₁₃, a⁸ ₄, a⁸ ₁₅, a⁸ ₁₆, a⁸ ₁₉, a⁸ ₂₀, a⁸ ₂₁ and a⁸ ₂₂.

The second column of converted input data matrix 214 includes activations a¹, a¹, a¹, a¹, a¹, a¹, a¹ ₁, a¹ ₁, a¹ ₁₄, a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₂₀, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃, a² ₂, a² ₃, a² ₄, a² ₅, a² ₈, a² ₉, a² ₁₀, a² ₁₁, a² ₁₄, a² ₁₅, a² ₁₆, a² ₁₇, a² ₂₀, a² ₂₁, a² ₂₂, a² ₂₃, a³ ₂, a³ ₃, a³ ₄, a³ ₅, a³ ₈, a³ ₉, a³ ₁₀, a³ ₁₁, a³ ₁₄, a³ ₁₅, a³ ₁₆, a³ ₁₇, a³ ₂₀, a³ ₂₁, a³ ₂₂, a³ ₂₃, a⁴ ₂, a⁴ ₃, a⁴ ₄, a⁴ ₅, a⁴ ₈, a⁴ ₉, a⁴ ₁₀, a⁴ ₁₁, a⁴ ₁₄, a⁴ ₁₅, a⁴ ₁₆, a⁴ ₁₇, a⁴ ₂₀, a⁴ ₂₁, a⁴ ₂₂, a⁴ ₂₃, a⁵ ₂, a⁵ ₃, a⁵ ₄, a⁵ ₅, a⁵ ₈, a⁵ ₉, a⁵ ₁₀, a⁵ ₁₁, a⁵ ₁₄, a⁵ ₁₅, a⁵ ₁₆, a⁵ ₁₇, a⁵ ₂₀, a⁵ ₂₁, a⁵ ₂₂, a⁵ ₂₃, a⁶ ₂, a⁶ ₃, a⁶ ₄, a⁶ ₅, a⁶ ₈, a⁶ ₉, a⁶ ₁₀, a⁶ ₁₁, a⁶ ₁₄, a⁶ ₁₅, a⁶ ₁₆, a⁶ ₁₇, a⁶ ₂₀, a⁶ ₂₁, a⁶ ₂₂, a⁶ ₂₃, a⁷ ₂, a⁷ ₃, a⁷ ₄, a⁷ ₅, a⁷ ₈, a⁷ ₉, a⁷ ₁₀, a⁷ ₁₁, a⁷ ₁₄, a⁷ ₁₅, a⁷ ₁₆, a⁷ ₁₇, a⁷ ₂₀, a⁷ ₂₁, a⁷ ₂₂, a⁷ ₂₃, a⁸ ₂, a⁸ ₃, a⁸ ₄, a⁸ ₅, a⁸ ₈, a⁸ ₉, a⁸ ₁₀, a⁸ ₁₁, a⁸ ₁₄, a⁸ ₁₅, a⁸ ₁₆, a⁸ ₁₇, a⁸ ₂₀, a⁸ ₂₁, a⁸ ₂₂ and a⁸ ₂₃.

The third column of converted input data matrix 214 includes activations a¹ ₃, a¹ ₄, a¹ ₅, a¹ ₆, a¹ ₉, a¹ ₁₀, a¹ ₁₁, a¹ ₁₂, a¹ ₁₅, a¹ ₁₆, a¹ ₁₇, a¹ ₁₈, a¹ ₂₁, a¹ ₂₂, a¹ ₂₃, a¹ ₂₄, a² ₃, a² ₄, a² ₅, a² ₆, a² ₉, a² ₁₀, a² ₁₁, a² ₁₂, a² ₁₅, a² ₁₆, a² ₇, a² ₁₈, a² ₂₁, a² ₂₂, a² ₂₃, a² ₂₄, a³ ₃, a³ ₄, a³ ₅, a³ ₆, a³ ₉, a³ ₁₀, a³ ₁₁, a³ ₁₂, a³ ₁₅, a³ ₁₆, a³ ₇, a³ ₁₈, a³ ₂₁, a³ ₂₂, a³ ₂₃, a³ ₂₄, a⁴ ₃, a⁴ ₄, a⁴ ₅, a⁴ ₆, a⁴ ₉, a⁴ ₁₀, a⁴ ₁₁, a⁴ ₁₂, a⁴ ₁₅, a⁴ ₁₆, a⁴ ₁₇, a⁴ ₁₈, a⁴ ₂₁, a⁴ ₂₂, a⁴ ₂₃, a⁴ ₂₄, a⁵ ₃, a⁵ ₄, a⁵ ₅, a⁵ ₆, a⁵ ₉, a⁵ ₁₀, a⁵ ₁₁, a⁵ ₁₂, a⁵ ₁₅, a⁵ ₁₆, a⁵ ₁₇, a⁵ ₁₈, a⁵ ₂₁, a⁵ ₂₂, a⁵ ₂₃, a⁵ ₂₄, a⁶ ₃, a⁶ ₄, a⁶ ₅, a⁶ ₆, a⁶ ₉, a⁶ ₁₀, a⁶, a⁶ ₁₂, a⁶ ₁₅, a⁶ ₁₆, a⁶ ₁₇, a⁶ ₁₈, a⁶ ₂₁, a⁶ ₂₂, a⁶ ₂₃, a⁶ ₂₄, a⁷ ₃, a⁷ ₄, a⁷ ₅, a⁷ ₆, a⁷ ₉, a⁷ ₁₀, a⁷ ₁₁, a⁷ ₁₂, a⁷ ₁₅, a⁷ ₁₆, a⁷ ₁₇, a⁷ ₁₈, a⁷ ₂₁, a⁷ ₂₂, a⁷ ₂₃, a⁷ ₂₄, a⁸ ₃, a⁸ ₄, a⁸ ₅, a⁸ ₆, a⁸ ₉, a⁸ ₁₀, a⁸, a⁸ ₁₂, a⁸ ₁₅, a⁸ ₁₆, a⁸ ₁₇, a⁸ ₁₈, a⁸ ₂₁, a⁸ ₂₂, a⁸ ₂₃ and a⁸ ₂₄.

The remaining columns of converted input data matrix 214 are formed in a similar manner. The fourth column of converted input data matrix 214 includes the first blocks from quadrants a¹ _(q2), a² _(q2), a³ _(q2), a⁴ _(q2), a⁵ _(q2), a⁶ _(q2), a⁷ _(q2) and a⁸ _(q2), the second column of converted input data matrix 214 includes the second blocks from quadrants a¹ _(q2), a² _(q2), a³ _(q2), a⁴ _(q2), a⁵ _(q2), a⁶ _(q2), a⁷ _(q2) and a⁸ _(q2), and the third column of converted input data matrix 214 includes the third blocks from quadrants a¹ _(q2), a² _(q2), a³ _(q2), a⁴ _(q2), a⁵ _(q2), a⁶ _(q2), a⁷ _(q2) and a⁸ _(q2). The seventh column of converted input data matrix 214 includes the first blocks from quadrants a¹ _(q3), a² _(q3), a³ _(q3), a⁴ _(q3), a⁵ _(q3), a⁶ _(q3), a⁷ _(q3) and a⁸ _(q3), the second column of converted input data matrix 214 includes the second blocks from quadrants a¹ _(q3), a² _(q3), a³ _(q3), a⁴ _(q3), a⁵ _(q3), a⁶ _(q3), a⁷ _(q3) and a⁸ _(q3), and the third column of converted input data matrix 214 includes the third blocks from quadrants a¹ _(q3), a² _(q3), a³ _(q3), a⁴ _(q3), a⁵ _(q3), a⁶ _(q3), a⁷ _(q3) and a⁸ _(q3).

Converted output data matrix 216 is a 3×9 matrix, and includes flattened versions of output data matrices 206 ¹, 206 ² and 206 ³, i.e., converted output data matrices 216 ¹, 216 ² and 216 ³. Converted output data matrix 216 may also be arranged into three quadrants; in this case, each quadrant spans all three converted output data matrices 216 ¹, 216 ² and 216 ³. The first quadrant spans the first three columns of converted output data matrix 216, the second quadrant spans the next three columns of converted output data matrix 216, and the third quadrant spans the last three columns of converted output data matrix 216. The first quadrant for each converted output data matrix 216 ¹, 216 ² and 216 ³ (i.e., o¹, o² _(q1) and o³ _(q1)) is labeled; the remaining quadrants are not labeled for clarity.

For converted output data matrix 216 ¹, the first quadrant o¹ _(q1) includes o¹ ₁, o¹ ₂, o¹ ₃, the second quadrant o¹ _(q2) includes o¹ ₄, o¹ ₅ and o¹ ₆, and the third quadrant o¹ _(q3) includes o¹ ₇, o¹ ₈ and o¹ ₉. For converted output data matrix 216 ², the first quadrant o² _(q1) includes o² ₁, o² ₂, o² ₃, the second quadrant o² _(q2) includes o² ₄, o² ₅ and o² ₆, and the third quadrant o² _(q3) includes o² ₇, o² ₈ and o² ₉. For converted output data matrix 216 ³, the first quadrant o³ _(q1) includes o³ ₁, o³ ₂, o³ ₃, the second quadrant o³ _(q2) includes o³ ₄, o³ ₅ and o³ ₆, and the third quadrant o³ _(q3) includes o³ ₇, o³ ₈ and o³ ₉.

The calculation of the output elements in quadrant o¹ _(q1) of converted output data matrix 216 ¹ follows.

Output element o¹ ₁ is the dot product of the first row of converted weight matrix 212, i.e., converted weight tensor 212 ¹, and the first column of converted input data matrix 214.

More particularly, output element o¹ ₁ is equal to w¹ ₁·a¹ ₁+w¹ ₂·a¹ ₂+w¹ ₃·a¹ ₃+w¹ ₄·a¹ ₄+w¹ ₅·a¹ ₇+w¹ ₆·a¹ ₈+w¹ ₇·a¹ ₉+w¹ ₈·a¹ ₁₀+w¹ ₉·a¹ ₁₃+w¹ ₁₀·a¹ ₁₄+w¹ ₁₁·a¹ ₁₅+w¹ ₁₂·a¹ ₁₆+w¹ ₁₃·a¹ ₁₉+w¹ ₁₄·a¹ ₂₀+w¹ ₁₅·a¹ ₂₁+w¹ ₁₆·a¹ ₂₂+w² ₁·a² ₁+w² ₂·a² ₂+w² ₃·a² ₃+w² ₄·a² ₄+w² ₅·a² ₇+w² ₆·a² ₈+w² ₇·a² ₉+w² ₈·a² ₁₀+w² ₉·a² ₁₃+w² ₁₀·a² ₁₄+w² ₁₁·a² ₁₅+w² ₁₂·a² ₁₆+w² ₁₃·a² ₁₉+w² ₁₄·a² ₂₀+w² ₁₅·a² ₂₁+w² ₁₆·a² ₂₂+w³ ₁·a³ ₁+w³ ₂·a³ ₂+w³ ₃·a³ ₃+w³ ₄·a³ ₄+w³ ₅·a³ ₇+w³ ₆·a³ ₈+w³ ₇·a³ ₉+w³ ₈·a³ ₁₀+w³ ₉·a³ ₁₃+w³ ₁₀·a³ ₁₄+w³ ₁₁·a³ ₁₅+w³ ₁₂·a³ ₁₆+w³ ₁₃·a³ ₁₉+w³ ₁₄·a³ ₂₀+w³ ₁₅·a³ ₂₁+w³ ₁₆·a³ ₂₂+w⁴ ₁·a⁴ ₁+w⁴ ₂·a⁴ ₂+w⁴ ₃·a⁴ ₃+w⁴ ₄·a⁴ ₄+w⁴ ₅·a⁴ ₇+w⁴ ₆·a⁴ ₈+w⁴ ₇·a⁴ ₉+w⁴ ₈·a⁴ ₁₀+w⁴ ₉·a⁴ ₁₃+w⁴ ₁₀·a⁴ ₁₄+w⁴ ₁₁·a⁴ ₁₅+w⁴ ₁₂·a⁴ ₁₆+w⁴ ₁₃·a⁴ ₁₉+w⁴ ₁₄·a⁴ ₂₀+w⁴ ₁₅·a⁴ ₂₁+w⁴ ₁₆·a⁴ ₂₂+w⁵·a⁵ ₁+w⁵ ₂·a⁵ ₂+w⁵ ₃·a⁵ ₃+w⁵ ₄·a⁵ ₄+w⁵ ₅·a⁵ ₇+w⁵ ₆·a⁵ ₈+w⁵ ₇·a⁵ ₉+w⁵ ₈·a⁵ ₁₀+w⁵ ₉·a⁵ ₁₃+w⁵ ₁₀·a⁵ ₁₄+w⁵ ₁₁·a⁵ ₁₅+w⁵ ₁₂·a⁵ ₁₆+w⁵ ₁₃·a⁵ ₁₉+w⁵ ₁₄·a⁵ ₂₀+w⁵ ₁₅·a⁵ ₂₁+w⁵ ₁₆·a⁵ ₂₂+w⁶ ₁·a⁶ ₁+w⁶ ₂·a⁶ ₂+w⁶ ₃·a⁶ ₃+w⁶ ₄·a⁶ ₄+w⁶ ₅·a⁶ ₇+w⁶ ₆·a⁶ ₈+w⁶ ₇·a⁶ ₉+w⁶ ₈·a⁶ ₁₀+w⁶ ₉·a⁶ ₁₃+w⁶ ₁₀·a⁶ ₁₄+w⁶ ₁·a⁶ ₁₅+w⁶ ₁₂·a⁶ ₁₆+w⁶ ₁₃·a⁶ ₁₉+w⁶ ₁₄·a⁶ ₂₀+w⁶ ₁₅·a⁶ ₂₁+w⁶ ₁₆·a⁶ ₂₂+w⁷ ₁·a⁷ ₁+w⁷ ₂·a⁷ ₂+w⁷ ₃·a⁷ ₃+w⁷ ₄·a⁷ ₄+w⁷ ₅·a⁷ ₇+w⁷ ₆·a⁷ ₈+w⁷ ₇·a⁷ ₉+w⁷ ₈·a⁷ ₁₀+w⁷ ₉·a⁷ ₁₃+w⁷ ₁₀·a⁷ ₁₄+w⁷ ₁₁·a⁷ ₁₅+w⁷ ₁₂·a⁷ ₁₆+w⁷ ₁₃·a⁷ ₁₉+w⁷ ₁₄·a⁷ ₂₀+w⁷ ₁₅·a⁷ ₂₁+w⁷ ₁₆·a⁷ ₂₂+w⁸ ₁·a⁸ ₁+w⁸ ₂·a⁸ ₂+w⁸ ₃·a⁸ ₃+w⁸ ₄·a⁸ ₄+w⁸ ₅·a⁸ ₇+w⁸ ₆·a⁸ ₈+w⁸ ₇·a⁸ ₉+w⁸ ₈·a⁸ ₁₀+w⁸ ₉·a⁸ ₁₃+w⁸ ₁₀·a⁸ ₁₄+w⁸ ₁₁·a⁸ ₁₅+w⁸ ₁₂·a⁸ ₁₆+w⁸ ₁₃·a⁸ ₁₉+w⁸ ₁₄·a⁸ ₂₀+w⁸ ₁₅·a⁸ ₂₁+w⁸ ₁₆·a⁸ ₂₂.

As shown above, output element o¹ ₁ of converted output data matrix 216 is equal to output element o¹ ₁ of output data matrix 206 ¹.

Output element o¹ ₂ is the dot product of the first row of converted weight matrix 212, i.e., converted weight tensor 212 ¹, and the second column of converted input data matrix 214.

More particularly, output element o¹ ₂ is equal to w¹ ₁·a¹ ₂+w¹ ₂·a¹ ₃+w¹ ₃·a¹ ₄+w¹ ₄·a¹ ₅+w¹ ₅·a¹ ₈+w¹ ₆·a¹ ₉+w¹ ₇·a¹ ₁₀+w¹ ₈·a¹ ₁₁+w¹ ₉·a¹ ₁₄+w¹ ₁₀·a¹ ₁₅+w¹ ₁₁·a¹ ₆+w¹ ₁₂·a¹ ₁₇+w¹ ₁₃·a¹ ₂₀+w¹ ₁₄·a² ₁+w¹ ₁₅·a¹ ₂₂+w¹ ₁₆·a¹ ₂₃+w² ₁·a² ₂+w² ₂·a² ₃+w² ₃·a² ₄+w² ₄·a² ₅+w² ₅·a² ₈+w² ₆·a² ₉+w² ₇·a² ₁₀+w² ₈·a² ₁₁+w² ₉·a² ₁₄+w² ₁₀·a² ₁₅+w² ₁₁·a² ₁₆+w² ₁₂·a² ₁₇+w² ₁₃·a² ₂₀+w² ₁₄·a² ₂₁+w² ₁₅·a² ₂₂+w² ₁₆·a² ₂₃+w³ ₁·a³ ₂+w³ ₂·a³ ₃+w³ ₃·a³ ₄+w³ ₄·a³ ₅+w³ ₅·a³ ₈+w³ ₆·a³ ₉+w³ ₇·a³ ₁₀+w³ ₈·a³ ₁₁+w³ ₉·a³ ₁₄+w³ ₁₀·a³ ₁₅+w³ ₁₁·a³ ₁₆+w³ ₁₂·a³ ₁₇+w³ ₁₃·a³ ₂₀+w³ ₁₄·a³ ₂₁+w³ ₁₅·a³ ₂₂+w³ ₁₆·a³ ₂₃+w⁴ ₁·a⁴ ₂+w⁴ ₂·a⁴ ₃+w⁴ ₃·a⁴ ₄+w⁴ ₄·a⁴ ₅+w⁴ ₅·a⁴ ₈+w⁴ ₆·a⁴ ₉+w⁴ ₇·a⁴ ₁₀+w⁴ ₈·a⁴ ₁₁+w⁴ ₉·a⁴ ₁₄+w⁴ ₁₀·a⁴ ₁₅+w⁴ ₁₁·a⁴ ₁₆+w⁴ ₁₂·a⁴ ₁₇+w⁴ ₁₃·a⁴ ₂₀+w⁴ ₁₄·a⁴ ₂₁+w⁴ ₁₅·a⁴ ₂₂+w⁴ ₁₆·a⁴ ₂₃+w⁵ ₁·a⁵ ₂+w⁵ ₂·a⁵ ₃+w⁵ ₃·a⁵ ₄+w⁵ ₄·a⁵ ₅+w⁵ ₅·a⁵ ₈+w⁵ ₆·a⁵ ₉+w⁵ ₇·a⁵ ₁₀+w⁵ ₈·a⁵ ₁₁+w⁵ ₉·a⁵ ₁₄+w⁵ ₁₀·a⁵ ₁₅+w⁵ ₁₁·a⁵ ₁₆+w⁵ ₁₂·a⁵ ₁₇+w⁵ ₁₃·a⁵ ₂₀+w⁵ ₁₄·a⁵ ₂₁+w⁵ ₁₅·a⁵ ₂₂+w⁵ ₁₆·a⁵ ₂₃+w⁶·a⁶ ₂+w⁶ ₂·a⁶ ₃+w⁶ ₃·a⁶ ₄+w⁶ ₄·a⁶ ₅+w⁶ ₅·a⁶ ₈+w⁶ ₆·a⁶ ₉+w⁶ ₇·a⁶ ₁₀+w⁶ ₈·a⁶ ₁₁+w⁶ ₉·a⁶ ₁₄+w⁶ ₁₀·a⁶ ₁₅+w⁶ ₁₁·a⁶ ₁₆+w⁶ ₁₂·a⁶ ₁₇+w⁶ ₁₃·a⁶ ₂₀+w⁶ ₁₄·a⁶ ₂₁+w⁶ ₁₅·a⁶ ₂₂+w⁶ ₁₆·a⁶ ₂₃+w⁷ ₁·a⁷ ₂+w⁷ ₂·a⁷ ₃+w⁷ ₃·a⁷ ₄+w⁷ ₄·a⁷ ₅+w⁷ ₅·a⁷ ₈+w⁷ ₆·a⁷ ₉+w⁷ ₇·a⁷ ₁₀+w⁷ ₈·a⁷ ₁₁₀+w⁷ ₉·a⁷ ₁₄+w⁷ ₁₀·a⁷ ₁₅+w⁷ ₁₁·a⁷ ₁₆+w⁷ ₁₂·a⁷ ₁₇+w⁷ ₁₃·a⁷ ₂₀+w⁷ ₁₄·a⁷ ₂₁+w⁷ ₁₅·a⁷ ₂₂+w⁷ ₁₆·a⁷ ₂₃+w⁸ ₁·a⁸ ₂+w⁸ ₂·a⁸ ₃+w⁸ ₃·a⁸ ₄+w⁸ ₄·a⁸ ₅+w⁸ ₅·a⁸ ₈+w⁸ ₆·a⁸ ₉+w⁸ ₇·a⁸ ₁₀+w⁸ ₈·a⁸ ₁₁+w⁸ ₉·a⁸ ₁₄+w⁸ ₁₀·a⁸ ₁₅+w⁸ ₁₁·a⁸ ₁₆+w⁸ ₁₂·a⁸ ₁₇+w⁸ ₁₃·a⁸ ₂₀+w⁸ ₁₄·a⁸ ₂₁₀+w⁸ ₁₅·a⁸ ₂₂+w⁸ ₁₆·a⁸ ₂₃.

As shown above, output element o¹ ₂ of converted output data matrix 216 is equal to output element o¹ ₂ of output data matrix 206 ¹.

Output element o¹ ₃ is the dot product of the first row of converted weight matrix 212, i.e., converted weight tensor 212 ¹, and the third column of converted input data matrix 214.

More particularly, output element o¹ ₃ is equal to w¹ ₁·a¹ ₃+w¹ ₂·a¹ ₄+w¹ ₃·a¹ ₅+w¹ ₄·a¹ ₆+w¹ ₅·a¹ ₉+w¹ ₆·a¹ ₁₀+w¹ ₇·a¹ ₁₁+w¹ ₈·a¹ ₁₂+w¹ ₉·a¹ ₁₅+w¹ ₁₀·a¹ ₁₆+w¹ ₁₁·a¹ ₁₇+w¹ ₁₂·a¹ ₁₈+w¹ ₁₃·a¹ ₂₁+w¹ ₁₄·a¹ ₂₂+w¹ ₁₅·a¹ ₂₃+w¹ ₁₆·a¹ ₂₄+w² ₁·a² ₃+w² ₂·a² ₄+w² ₃·a² ₅+w² ₄·a² ₆+w² ₅·a² ₉+w² ₆·a² ₁₀+w² ₇·a² ₁₁+w² ₈·a² ₁₂+w² ₉·a² ₁₅+w² ₁₀·a² ₁₆+w² ₁₁·a² ₁₇+w² ₁₂·a² ₁₈+w² ₁₃·a² ₂₁+w² ₁₄·a² ₂₂+w² ₁₅·a² ₂₃+w² ₁₆·a² ₂₄+w³ ₁·a³ ₃+w³ ₂·a³ ₄+w³ ₃·a³ ₅+w³ ₄·a³ ₆+w³ ₅·a³ ₉+w³ ₆·a³ ₁₀+w³ ₇·a³ ₁₁+w³ ₈·a³ ₁₂+w³ ₉·a³ ₁₅+w³ ₁₀·a³ ₁₆+w³ ₁₁·a³ ₁₇+w³ ₁₂·a³ ₁₈+w³ ₁₃·a³ ₂₁+w³ ₁₄·a³ ₂₂+w³ ₁₅·a³ ₂₃+w³ ₁₆·a³ ₂₄+w⁴ ₁·a⁴ ₃+w⁴ ₂·a⁴ ₄+w⁴ ₃·a⁴ ₅+w⁴ ₄·a⁴ ₆+w⁴ ₅·a⁴ ₉+w⁴ ₆·a⁴ ₁₀+w⁴ ₇·a⁴ ₁₁+w⁴ ₈·a⁴ ₁₂+w⁴ ₉·a⁴ ₁₅+w⁴ ₁₀·a⁴ ₁₆+w⁴ ₁₁·a⁴ ₁₇+w⁴ ₁₂·a⁴ ₁₈+w⁴ ₁₃·a⁴ ₂₁+w⁴ ₁₄·a⁴ ₂₂+w⁴ ₁₅·a⁴ ₂₃+w⁴ ₁₆·a⁴ ₂₄+w⁵ ₁·a⁵ ₃+w⁵ ₂·a⁵ ₄+w⁵ ₃·a⁵ ₅+w⁵ ₄·a⁵ ₆+w⁵ ₅·a⁵ ₉+w⁵ ₆·a⁵ ₁₀+w⁵ ₇·a⁵ ₁₁+w⁵ ₈·a⁵ ₁₂+w⁵ ₉·a⁵ ₁₅+w⁵ ₁₀·a⁵ ₁₆+w⁵ ₁₁·a⁵ ₁₇+w⁵ ₁₂·a⁵ ₁₈+w⁵ ₁₃·a⁵ ₂₁+w⁵ ₁₄·a⁵ ₂₂+w⁵ ₁₅·a⁵ ₂₃+w⁵ ₁₆·a⁵ ₂₄+w⁶ ₁·a⁶ ₃+w⁶ ₂·a⁶ ₄+w⁶ ₃·a⁶ ₅+w⁶ ₄·a⁶ ₆+w⁶ ₅·a⁶ ₉+w⁶ ₆·a⁶ ₁₀+w⁶ ₇·a⁶ ₁₁+w⁶ ₈·a⁶ ₁₂+w⁶ ₉·a⁶ ₁₅+w⁶ ₁₀·a⁶ ₁₆+w⁶ ₁₁·a⁶ ₁₇+w⁶ ₁₂·a⁶ ₁₈+w⁶ ₁₃·a⁶ ₂₁+w⁶ ₁₄·a⁶ ₂₂+w⁶ ₁₅·a⁶ ₂₃+w⁶ ₁₆·a⁶ ₂₄+w⁷ ₁·a⁷ ₃+w⁷ ₂·a⁷ ₄+w⁷ ₃·a⁷ ₅+w⁷ ₄·a⁷ ₆+w⁷ ₅·a⁷ ₉+w⁷ ₆·a⁷ ₁₀+w⁷ ₇·a⁷ ₁₁+w⁷ ₈·a⁷ ₁₂+w⁷ ₉·a⁷ ₁₅+w⁷ ₁₀·a⁷ ₁₆+w⁷ ₁₁·a⁷ ₁₇+w⁷ ₁₂·a⁷ ₁₈+w⁷ ₁₃·a⁷ ₂₁+w⁷ ₁₄·a⁷ ₂₂+w⁷ ₁₅·a⁷ ₂₃+w⁷ ₁₆·a⁷ ₂₄+w⁸ ₁·a⁸ ₃+w⁸ ₂·a⁸ ₄+w⁸ ₃·a⁸ ₅+w⁸ ₄·a⁸ ₆+w⁸ ₅·a⁸ ₉+w⁸ ₆·a⁸ ₁₀+w⁸ ₇·a⁸ ₁₁+w⁸ ₈·a⁸ ₁₂+w⁸ ₉·a⁸ ₁₅+w⁸ ₁₀·a⁸ ₁₆+w⁸ ₁₁·a⁸ ₁₇+w⁸ ₁₂·a⁸ ₁₈+w⁸ ₁₃·a⁸ ₂₁+w⁸ ₁₄·a⁸ ₂₂+w⁸ ₁₅·a⁸ ₂₃+w⁸ ₁₆·a⁸ ₂₄.

As shown above, output element o¹ ₃ of converted output data matrix 216 is equal to output element o¹ ₃ of output data matrix 206 ¹.

The calculation of the output elements in quadrant o² _(q1) of converted output data matrix 216 ² follows.

Output element o² ₁ is the dot product of the second row of converted weight matrix 212, i.e., converted weight tensor 212 ², and the first column of converted input data matrix 214.

More particularly, output element o² ₁ is equal to x¹ ₁·a¹ ₁+x¹ ₂·a¹ ₂+x¹ ₃·a¹ ₃+x¹ ₄·a¹ ₄+x¹ ₅·a¹ ₇+x¹ ₆·a¹ ₈+x¹ ₇·a¹ ₉+x¹ ₈·a¹ ₁₀+x¹ ₉·a¹ ₁₃+x¹ ₁₀·a¹ ₁₄+x¹ ₁₁·a¹ ₁₅+x¹ ₁₂·a¹ ₁₆+x¹ ₁₃·a¹ ₁₉+x¹ ₁₄·a¹ ₂₀+x¹ ₁₅·a¹ ₂₁+x¹ ₁₆·a¹ ₂₂+x² ₁·a²+x² ₂·a² ₂+x² ₃·a² ₃+x² ₄·a² ₄+x² ₅·a² ₇+x² ₆·a² ₈+x² ₇·a² ₉+x² ₈·a² ₁₀+x² ₉·a² ₁₃+x² ₁₀·a² ₁₄+x² ₁₁·a² ₁₅+x² ₁₂·a² ₁₆+x² ₁₃·a² ₁₉+x² ₁₄·a² ₂₀+x² ₁₅·a² ₂₁+x² ₁₆·a² ₂₂+x³ ₁·a³ ₁+x³ ₂·a³ ₂+x³ ₃·a³ ₃+x³ ₄·a³ ₄+x³ ₅·a³ ₇+x³ ₆·a³ ₈+x³ ₇·a³ ₉+x³ ₈·a³ ₁₀+x³ ₉·a³ ₁₃+x³ ₁₀·a³ ₁₄+x³ ₁₁·a³ ₁₅+x³ ₁₂·a³ ₁₆+x³ ₁₃·a³ ₁₉+x³ ₁₄·a³ ₂₀+x³ ₁₅·a³ ₂₁+x³ ₁₆·a³ ₂₂+x⁴ ₁·a⁴ ₁+x⁴ ₂·a⁴ ₂+x⁴ ₃·a⁴ ₃+x⁴ ₄·a⁴ ₄+x⁴ ₅·a⁴ ₇+x⁴ ₆·a⁴ ₈+x⁴ ₇·a⁴ ₉+x⁴ ₈·a⁴ ₁₀+x⁴ ₉·a⁴ ₁₃+x⁴ ₁₀·a⁴ ₁₄+x⁴ ₁₁·a⁴ ₁₅+x⁴ ₁₂·a⁴ ₁₆+x⁴ ₁₃·a⁴ ₁₉+x⁴ ₁₄·a⁴ ₂₀+x⁴ ₁₅·a⁴ ₂₁+x⁴ ₁₆·a⁴ ₂₂+x⁵ ₁·a⁵ ₁+x⁵ ₂·a⁵ ₂+x⁵ ₃·a⁵ ₃+x⁵ ₄·a⁵ ₄+x⁵ ₅·a⁵ ₇+x⁵ ₆·a⁵ ₈+x⁵ ₇·a⁵ ₉+x⁵ ₈·a⁵ ₁₀+x⁵ ₉·a⁵ ₁₃+x⁵ ₁₀·a⁵ ₁₄+x⁵ ₁·a⁵ ₁₅+x⁵ ₁₂·a⁵ ₁₆+x⁵ ₁₃·a⁵ ₁₉+x⁵ ₁₄·a⁵ ₂₀+x⁵ ₁₅·a⁵ ₂₁+x⁵ ₁₆·a⁵ ₂₂+x⁶ ₁·a⁶ ₁+x⁶ ₂·a⁶ ₂+x⁶ ₃·a⁶ ₃+x⁶ ₄·a⁶ ₄+x⁶ ₅·a⁶ ₇+x⁶ ₆·a⁶ ₈+x⁶ ₇·a⁶ ₉+x⁶ ₈·a⁶ ₁₀+x⁶ ₉·a⁶ ₁₃+x⁶ ₁₀·a⁶ ₁₄+x⁶ ₁₁·a⁶ ₁₅+x⁶ ₁₂·a⁶ ₁₆+x⁶ ₁₃·a⁶ ₁₉+x⁶ ₁₄·a⁶ ₂₀+x⁶ ₁₅·a⁶ ₂₁+x⁶ ₁₆·a⁶ ₂₂+x⁷ ₁·a⁷ ₁+x⁷ ₂·a+x⁷ ₃·a⁷ ₃+x⁷ ₄·a⁷ ₄+x⁷ ₅·a⁷ ₇+x⁷ ₆·a⁷ ₈+x⁷ ₇·a⁷ ₉+x⁷ ₈·a⁷ ₁₀+x⁷ ₉·a⁷ ₁₃+x⁷ ₁₀·a⁷ ₁₄+x⁷ ₁₁·a⁷ ₁₅+x⁷ ₁₂·a⁷ ₁₆+x⁷ ₁₃·a⁷ ₁₉+x⁷ ₁₄·a⁷ ₂₀+x⁷ ₁₅·a⁷ ₂₁+x⁷ ₁₆·a⁷ ₂₂+x⁸ ₁·a⁸ ₁+x⁸ ₂·a⁸ ₂+x⁸ ₃·a⁸ ₃+x⁸ ₄·a⁸ ₄+x⁸ ₅·a⁸ ₇+x⁸ ₆·a⁸ ₈+x⁸ ₇·a⁸ ₉+x⁸ ₈·a⁸ ₁₀+x⁸ ₉·a⁸ ₁₃+x⁸ ₁₀·a⁸ ₁₄+x⁸ ₁₁·a⁸ ₁₅+x⁸ ₁₂·a⁸ ₁₆+x⁸ ₁₃·a⁸ ₁₉+x⁸ ₁₄·a⁸ ₂₀+x⁸ ₁₅·a⁸ ₂₁+x⁸ ₁₆·a⁸ ₂₂.

As shown above, output element o² ₁ of converted output data matrix 216 is equal to output element o² ₁ of output data matrix 206 ².

Output element o² ₂ is the dot product of the second row of converted weight matrix 212, i.e., converted weight tensor 212 ², and the second column of converted input data matrix 214.

More particularly, output element o² ₂ is equal to x¹ ₁·a¹ ₂+x¹ ₂·a¹ ₃+x¹ ₃·a¹ ₄+x¹ ₄·a¹ ₅+x¹ ₅·a¹ ₈+x¹ ₆·a¹ ₉+x¹ ₇·a¹ ₁₀+x¹ ₈·a¹ ₁₁+x¹ ₉·a¹ ₁₄+x¹ ₁₀·a¹ ₁₅+x¹ ₁₁·a¹ ₁₆+x¹ ₁₂·a¹ ₁₇+x¹ ₁₃·a¹ ₂₀+x¹ ₁₄·a¹ ₂₁+x¹ ₁₅·a¹ ₂₂+x¹ ₁₆·a¹ ₂₃+x²·a² ₂+x² ₂·a² ₃+x² ₃·a² ₄+x² ₄·a² ₅+x² ₅·a² ₈+x² ₆·x² ₉+x² ₇·a² ₁₀+x² ₈·a² ₁₁+x² ₉·a² ₁₄+x² ₁₀·a² ₁₅+x² ₁₁·a² ₁₆+x² ₁₂·a² ₁₇+x² ₁₃·a² ₂₀+x² ₁₄·a² ₂₁+x² ₁₅·a² ₂₂+x² ₁₆·a² ₂₃+x³ ₁·a³ ₂+x³ ₂·a³ ₃+x³ ₃·a³ ₄+x³ ₄·a³ ₅+x³ ₅·a³ ₈+x³ ₆·a³ ₉+x³ ₇·a³ ₁₀+x³ ₈·a³ ₁₁+x³ ₉·a³ ₁₄+x³ ₁₀·a³ ₁₅+x³ ₁₁·a³ ₁₆+x³ ₁₂·a³ ₁₇+x³ ₁₃·a³ ₂₀+x³ ₁₄·a³ ₂₁+x³ ₁₅·a³ ₂₂+x³ ₁₆·a³ ₂₃+x⁴ ₁·a⁴ ₂+x⁴ ₂·a⁴ ₃+x⁴ ₃·a⁴ ₄+x⁴ ₄·a⁴ ₅+x⁴ ₅·a⁴ ₈+x⁴ ₆·a⁴ ₉+x⁴ ₇·a⁴ ₁₀+x⁴ ₈·a⁴ ₁₁+x⁴ ₉·a⁴ ₁₄+x⁴ ₁₀·a⁴ ₁₅+x⁴ ₁₁·a⁴ ₁₆+x⁴ ₁₂·a⁴ ₁₇+x⁴ ₁₃·a⁴ ₂₀+x⁴ ₁₄·a⁴ ₂₁+x⁴ ₁₅·a⁴ ₂₂+x⁴ ₁₆·a⁴ ₂₃+x⁵ ₁·a⁵ ₂+x⁵ ₂·a⁵ ₃+x⁵ ₃·a⁵ ₄+x⁵ ₄·a⁵ ₅+x⁵ ₅·a⁵ ₈+x⁵ ₆·a⁵ ₉+x⁵ ₇·a⁵ ₁₀+x⁵ ₈·a⁵ ₁₁+x⁵ ₉·a⁵ ₁₄+x⁵ ₁₀·a⁵ ₁₅+x⁵ ₁·a⁵ ₁₆+x⁵ ₁₂·a⁵ ₁₇+x⁵ ₁₃·a⁵ ₂₀+x⁵ ₁₄·a⁵ ₂₁+x⁵ ₁₅·a⁵ ₂₂+x⁵ ₁₆·a⁵ ₂₃+x⁶ ₁·a⁶ ₂+x⁶ ₂·a⁶ ₃+x⁶ ₃·a⁶ ₄+x⁶ ₄·a⁶ ₅+x⁶ ₅·a⁶ ₈+x⁶ ₆·a⁶ ₉+x⁶ ₇·a⁶ ₁₀+x⁶ ₈·a⁶ ₁₁+x⁶ ₉·a⁶ ₁₄+x⁶ ₁₀·a⁶ ₁₅+x⁶ ₁₁·a⁶ ₁₆+x⁶ ₁₂·a⁶ ₁₇+x⁶ ₁₃·a⁶ ₂₀+x⁶ ₁₄·a⁶ ₂₁+x⁶ ₁₅·a⁶ ₂₂+x⁶ ₁₆·a⁶ ₂₃+x⁷ ₁·a⁷ ₂+x⁷ ₂·a⁷ ₃+x⁷ ₃·a⁷ ₄+x⁷ ₄·a⁷ ₅+x⁷ ₅·a⁷ ₈+x⁷ ₆·a⁷ ₉+x⁷ ₇·a⁷ ₁₀+x⁷ ₈·a⁷ ₁₁₀+x⁷ ₉·a⁷ ₁₄+x⁷ ₁₀·a⁷ ₁₅+x⁷ ₁₁·a⁷ ₁₆+x⁷ ₁₂·a⁷ ₁₇+x⁷ ₁₃·a⁷ ₂₀+x⁷ ₁₄·a⁷ ₂₁+x⁷ ₁₅·a⁷ ₂₂+x⁷ ₁₆·a⁷ ₂₃+x⁸ ₁·a⁸ ₂+x⁸ ₂·a⁸ ₃+x⁸ ₃·a⁸ ₄+x⁸ ₄·a⁸ ₅+x⁸ ₅·a⁸ ₈+x⁸ ₆·a⁸ ₉+x⁸ ₇·a⁸ ₁₀+x⁸ ₈·a⁸ ₁₁+x⁸ ₉·a⁸ ₁₄+x⁸ ₁₀·a⁸ ₁₅+x⁸ ₁₁·a⁸ ₁₆+x⁸ ₁₂·a⁸ ₁₇+x⁸ ₁₃·a⁸ ₂₀+x⁸ ₁₄·a⁸ ₂₁₀+x⁸ ₁₅·a⁸ ₂₂+x⁸ ₁₆·a⁸ ₂₃.

As shown above, output element o² ₂ of converted output data matrix 216 is equal to output element o² ₂ of output data matrix 206 ².

Output element o² ₃ is the dot product of the second row of converted weight matrix 212, i.e., converted weight tensor 212 ², and the third column of converted input data matrix 214.

More particularly, output element o² ₃ is equal to x¹ ₁·a¹ ₃+x¹ ₂·a¹ ₄+x¹ ₃·a¹ ₅+x¹ ₄·a¹ ₆+x¹ ₅·a¹ ₉+x¹ ₆·a¹ ₁₀+x¹ ₇·a¹ ₁₁+x¹ ₈·a¹ ₁₂+x¹ ₉·a¹ ₁₅+x¹ ₁₀·a¹ ₁₆+x¹ ₁₁·a¹ ₁₇+x¹ ₁₂·a¹ ₁₈+x¹ ₁₃·a¹ ₂₁+x¹ ₁₄·a¹ ₂₂+x¹ ₁₅·a¹ ₂₃+x¹ ₁₆·a¹ ₂₄+x²·a² ₃+x² ₂·a² ₄+x² ₃·a² ₅+x² ₄·a² ₆+x² ₅·a² ₉+x² ₆·a² ₁₀+x² ₇·a² ₁₁+x² ₈·a² ₁₂+x² ₉·a² ₁₅+x² ₁₀·a² ₁₆+x² ₁₁·a² ₁₇+x² ₁₂·a² ₁₈+x² ₁₃·a² ₂₁+x² ₁₄·a² ₂₂+x² ₁₅·a² ₂₃+x² ₁₆·a² ₂₄+x³ ₁·a³ ₃+x³ ₂·a³ ₄+x³ ₃·a³ ₅+x³ ₄·a³ ₆+x³ ₅·a³ ₉+x³ ₆·a³ ₁₀+x³ ₇·a³ ₁₁+x³ ₈·a³ ₁₂+x³ ₉·a³ ₁₅+x³ ₁₀·a³ ₁₆+x³ ₁₁·a³ ₁₇+x³ ₁₂·a³ ₁₈+x³ ₁₃·a³ ₂₁+x³ ₁₄·a³ ₂₂+x³ ₁₅·a³ ₂₃+x³ ₁₆·a³ ₂₄+x⁴ ₁·a⁴ ₃+x⁴ ₂·a⁴ ₄+x⁴ ₃·a⁴ ₅+x⁴ ₄·a⁴ ₆+x⁴ ₅·a⁴ ₉+x⁴ ₆·a⁴ ₁₀+x⁴ ₇·a⁴ ₁₁+x⁴ ₈·a⁴ ₁₂+x⁴ ₉·a⁴ ₁₅+x⁴ ₁₀·a⁴ ₁₆+x⁴ ₁₁·a⁴ ₁₇+x⁴ ₁₂·a⁴ ₁₈+x⁴ ₁₃·a⁴ ₂₁+x⁴ ₁₄·a⁴ ₂₂+x⁴ ₁₅·a⁴ ₂₃+x⁴ ₁₆·a⁴ ₂₄+x⁵ ₁·a⁵ ₃+x⁵ ₂·a⁵ ₄+x⁵ ₃·a⁵ ₅+x⁵ ₄·a⁵ ₆+x⁵ ₅·a⁵ ₉+x⁵ ₆·a⁵ ₁₀+x⁵ ₇·a⁵ ₁₁+x⁵ ₈·a⁵ ₁₂+x⁵ ₉·a⁵ ₁₅+x⁵ ₁₀·a⁵ ₁₆+x⁵ ₁·a⁵ ₁₇+x⁵ ₁₂·a⁵ ₁₈+x⁵ ₁₃·a⁵ ₂₁+x⁵ ₁₄·a⁵ ₂₂+x⁵ ₁₅·a⁵ ₂₃+x⁵ ₁₆·a⁵ ₂₄+x⁶ ₁·a⁶ ₃+x⁶ ₂·a⁶ ₄+x⁶ ₃·a⁶ ₅+x⁶ ₄·a⁶ ₆+x⁶ ₅·a⁶ ₉+x⁶ ₆·a⁶ ₁₀+x⁶ ₇·a⁶ ₁₁+x⁶ ₈·a⁶ ₁₂+x⁶ ₉·a⁶ ₁₅+x⁶ ₁₀·a⁶ ₁₆+x⁶ ₁₁·a⁶ ₁₇+x⁶ ₁₂·a⁶ ₁₈+x⁶ _(13 *621)+x⁶ ₁₄·a⁶ ₂₂+x⁶ ₁₅·a⁶ ₂₃+x⁶ ₁₆·a⁶ ₂₄+x⁷ ₁·a⁷ ₃+x⁷ ₂·a⁷ ₄+x⁷ ₃·a⁷ ₅+x⁷ ₄·a⁷ ₆+x⁷ ₅·a⁷ ₉+x⁷ ₆·a⁷ ₁₀+x⁷ ₇·a⁷ ₁₁+x⁷ ₈·a⁷ ₁₂+x⁷ ₉·a⁷ ₁₅+x⁷ ₁₀·a⁷ ₁₆+x⁷ ₁₁·a⁷ ₁₇+x⁷ ₁₂·a⁷ ₁₈+x⁷ ₁₃·a⁷ ₂₁+x⁷ ₁₄·a⁷ ₂₂+x⁷ ₁₅·a⁷ ₂₃+x⁷ ₁₆·a⁷ ₂₄+x⁸ ₁·a⁸ ₃+x⁸ ₂·a⁸ ₄+x⁸ ₃·a⁸ ₅+x⁸ ₄·a⁸ ₆+x⁸ ₅·a⁸ ₉+x⁸ ₆·a⁸ ₁₀+x⁸ ₇·a⁸ ₁₁+x⁸ ₈·a⁸ ₁₂+x⁸ ₉·a⁸ ₁₅+x⁸ ₁₀·a⁸ ₁₆+x⁸ ₁·a⁸ ₁₇+x⁸ ₁₂·a⁸ ₁₈+x⁸ ₁₃·a⁸ ₂₁+x⁸ ₁₄·a⁸ ₂₂+x⁸ ₁₅·a⁸ ₂₃+x⁸ ₁₆·a⁸ ₂₄.

As shown above, output element o² ₃ of converted output data matrix 216 is equal to output element o² ₃ of output data matrix 206 ².

The calculation of the output elements in quadrant o³ _(q1) of converted output data matrix 216 ³ follows.

Output element o³ ₁ is the dot product of the third row of converted weight matrix 212, i.e., converted weight tensor 212 ³, and the first column of converted input data matrix 214.

More particularly, output element o³ ₁ is equal to y¹ ₁·a¹ ₁+y¹ ₂·a¹ ₂+y¹ ₃·a¹ ₃+y¹ ₄+a¹ ₄+y¹ ₅·a¹ ₇+y¹ ₆·a¹ ₈+y¹ ₇·a¹ ₉+y¹ ₈·a¹ ₁₀+y¹ ₉·a¹ ₁₃+y¹ ₁₀·a¹ ₁₄+y¹ ₁₁·a¹ ₁₅+y¹ ₁₂·a¹ ₁₆+y¹ ₁₃·a¹ ₁₉+y¹ ₁₄·a¹ ₂₀+y¹ ₁₅·a¹ ₂₁+y¹ ₁₆·a¹ ₂₂+a² ₁+y² ₂·a² ₂+y² ₃·a² ₃+y² ₄·a² ₄+y² ₅·a² ₇+y² ₆·a² ₈+y² ₇·a² ₉+y² ₈·a² ₁₀+y² ₉·a² ₁₃+y² ₁₀·a² ₁₄+y² ₁₁·a² ₁₅+y² ₁₂·a² ₁₆+y² ₁₃·a² ₁₉+y² ₁₄·a² ₂₀+y² ₁₅·a² ₂₁+y² ₁₆·a² ₂₂+y³ ₁·a³ ₁+y³ ₂·a³ ₂+y³ ₃·a³ ₃+y³ ₄·a³ ₄+y³ ₅·a³ ₇+y³ ₆·a³ ₈+y³ ₇·a³ ₉+y³ ₈·a³ ₁₀+y³ ₉·a³ ₁₃+y³ ₁₀, a³ ₁₄+y³ ₁₁·a³ ₁₅+y³ ₁₂·a³ ₁₆+y³ ₁₃·a³ ₁₉+y³ ₁₄·a³ ₂₀+y³ ₁₅·a³ ₂₁+y³ ₁₆·a³ ₂₂+y⁴ ₁·a⁴ ₁+y⁴ ₂·a⁴ ₂+y⁴ ₃·a⁴ ₃+y⁴ ₄·a⁴ ₅+y⁴ ₅·a⁴ ₇+y⁴ ₆·a⁴ ₈+y⁴ ₇·a⁴ ₉+y⁴ ₈·a⁴ ₁₀+y⁴ ₉·a⁴ ₁₃+y⁴ ₁₀·a⁴ ₁₄+y⁴ ₁₁·a⁴ ₁₅+y⁴ ₁₂·a⁴ ₁₆+y⁴ ₁₃·a⁴ ₁₉+y⁴ ₁₄·a⁴ ₂₀+y⁴ ₁₅·a⁴ ₂₁+y⁴ ₁₆·a⁴ ₂₂+y⁵ ₁·a⁵ ₁+y⁵ ₂·a⁵ ₂+y⁵ ₃·a⁵ ₃+y⁵ ₄·a⁵ ₄+y⁵ ₅·a⁵ ₇+y⁵ ₆·a⁵ ₈+y⁵ ₇·a⁵ ₉+y⁵ ₈·a⁵ ₁₀+y⁵ ₉·a⁵ ₁₃+y⁵ ₁₀·a⁵ ₄+y⁵ ₁₁·a⁵ ₁₅+y⁵ ₁₂·a⁵ ₁₆+y⁵ ₁₃·a⁵ ₁₉+y⁵ ₁₄·a⁵ ₂₀+y⁵ ₁₅·a⁵ ₂₁+y⁵ ₁₆·a⁵ ₂₂+y⁶ ₁·a⁶ ₁+y⁶ ₂·a⁶ ₂+y⁶ ₃·a⁶ ₃+y⁶ ₄·a⁶ ₄+y⁶ ₅·a⁶ ₇+y⁶ ₆·a⁶ ₈+y⁶ ₇·a⁶ ₉+y⁶ ₈·a⁶ ₁₀+y⁶ ₉·a⁶ ₁₃+y⁶ ₁₀·a⁶ ₁₄+y⁶ ₁₁·a⁶ ₁₅+y⁶ ₁₂·a⁶ ₁₆+y⁶ ₁₃·a⁶ ₁₉+y⁶ ₁₄·a⁶ ₂₀+y⁶ ₁₅·a⁶ ₂₁+y⁶ ₁₆·a⁶ ₂₂+y⁷ ₁·a⁷ ₁+y⁷ ₂·a⁷ ₂+y⁷ ₃·a⁷ ₃+y⁷ ₄·a⁷ ₄+y⁷ ₅·a⁷ ₇+y⁷ ₆·a⁷ ₈+y⁷ ₇·a⁷ ₉+y⁷ ₈·a⁷ ₁₀+y⁷ ₉·a¹ ₃+y⁷ ₁₀·a⁷ ₁₄+y⁷ ₁₁·a⁷ ₁₅+y⁷ ₁₂·a⁷ ₁₆+y⁷ ₁₃·a⁷ ₁₉+y⁷ ₁₄·a² ₂₀+y⁷ ₁₅·a² ₂₁+y⁷ ₁₆·a⁷ ₂₂+y⁸ ₁·a⁸ ₁+y⁸ ₂·a⁸ ₂+y⁸ ₃·a⁸ ₃+y⁸ ₄·a⁸ ₄+y⁸ ₅·a⁸ ₇+y⁸ ₆·a⁸ ₈+y⁸ ₇·a⁸ ₉+y⁸ ₈·a⁸ ₁₀+y⁸ ₉·a⁸ ₁₃+y⁸ ₁₀·a⁸ ₁₄+y⁸ ₁₁·a⁸ ₁₅+y⁸ ₁₂·a⁸ ₁₆+y⁸ ₁₃·a⁸ ₁₉+y⁸ ₁₄·a⁸ ₂₀+y⁸ ₁₅·a⁸ ₂₁+y⁸ ₁₆·a⁸ ₂₂.

As shown above, output element o³ ₁ of converted output data matrix 216 is equal to output element o³ ₁ of output data matrix 206 ³.

Output element o³ ₂ is the dot product of the third row of converted weight matrix 212, i.e., converted weight tensor 212 ³, and the second column of converted input data matrix 214.

More particularly, output element o³ ₂ is equal to y¹ ₁·a¹ ₂+y¹ ₂·a¹ ₃+y¹ ₃·a¹ ₄+y¹ ₄·a¹ ₅+y¹ ₅·a¹ ₈+y¹ ₆·a¹ ₉+y¹ ₇·a¹ ₁₀+y¹ ₈·a¹ ₁₁+y¹ ₉·a¹ ₁₄+y¹ ₁₀·a¹ ₁₅+y¹ ₁₁·a¹ ₁₆+y¹ ₁₂·a¹ ₁₇+y¹ ₁₃·a¹ ₂₀+y¹ ₁₄·a¹ ₂₁+y¹ ₁₅·a¹ ₂₂+y¹ ₁₆·a¹ ₂₃+a²·a² ₂+y² ₂·a² ₃+y² ₃·a² ₄+y² ₄·a² ₅+y² ₅·a² ₈+y² ₆·a² ₉+y² ₇·a² ₁₀+y² ₈·a² ₁₁+y² ₉·a² ₁₄+y² ₁₀·a² ₁₅+y² ₁₁·a² ₁₆+y² ₁₂·a² ₁₇+y² ₁₃·a² ₂₀+y² ₁₄·a² ₂₁+y² ₁₅·a² ₂₂+y² ₁₆·a² ₂₃+y³ ₁·a³ ₂+y³ ₂·a³ ₃+y³ ₃·a³ ₄+y³ ₄·a³ ₅+y³ ₅·a³ ₈+y³ ₆·a³ ₉+y³ ₇·a³ ₁₀+y³ ₈·a³ ₁₁+y³ ₉·a³ ₁₄+y³ ₁₀·a³ ₁₅+y³ ₁₁·a³ ₁₆+y³ ₁₂·a³ ₁₇+y³ ₁₃·a³ ₂₀+y³ ₁₄·a³ ₂₁+y³ ₁₅·a³ ₂₂+y³ ₁₆·a³ ₂₃+y⁴ ₁·a⁴ ₂+y⁴ ₂·a⁴ ₃+y⁴ ₃·a⁴ ₄+y⁴ ₄·a⁴ ₅+y⁴ ₅·a⁴ ₈+y⁴ ₆·a⁴ ₉+y⁴ ₇·a⁴ ₁₀+y⁴ ₈·a⁴ ₁₁+y⁴ ₉·a⁴ ₁₄+y⁴ ₁₀·a⁴ ₁₅+y⁴ ₁₁·a⁴ ₁₆+y⁴ ₁₂·a⁴ ₁₇+y⁴ ₁₃·a⁴ ₂₀+y⁴ ₁₄·a⁴ ₂₁+y⁴ ₁₅·a⁴ ₂₂+y⁴ ₁₆·a⁴ ₂₃+y⁵ ₁·a⁵ ₂+y⁵ ₂·a⁵ ₃+y⁵ ₃·a⁵ ₄+y⁵ ₄·a⁵ ₅+y⁵ ₅·a⁵ ₈+y⁵ ₆·a⁵ ₉+y⁵ ₇·a⁵ ₁₀+y⁵ ₈·a⁵ ₁₁+y⁵ ₉·a⁵ ₁₄+y⁵ ₁₀·a⁵ ₁₅+y⁵ ₁₁·a⁵ ₁₆+y⁵ ₁₂·a⁵ ₁₇+y⁵ ₁₃·a⁵ ₂₀+y⁵ ₁₄·a⁵ ₂₁+y⁵ ₁₅·a⁵ ₂₂+y⁵ ₁₆·a⁵ ₂₃+y⁶ ₁·a⁶ ₂+y⁶ ₂·a⁶ ₃+y⁶ ₃·a⁶ ₄+y⁶ ₄·a⁶ ₅+y⁶ ₅·a⁶ ₈+y⁶ ₆·a⁶ ₉+y⁶ ₇·a⁶ ₁₀+y⁶ ₈·a⁶ ₁₁+y⁶ ₉·a⁶ ₁₄+y⁶ ₁₀·a⁶ ₁₅+y⁶ ₁·a⁶ ₁₆+y⁶ ₁₂·a⁶ ₁₇+y⁶ ₁₃·a⁶ ₂₀+y⁶ ₁₄·a⁶ ₂₁+y⁶ ₁₅·a⁶ ₂₂+y⁶ ₁₆·a⁶ ₂₃+y⁷ ₁·a⁷ ₂+y⁷ ₂·a⁷ ₃+y⁷ ₃·a⁷ ₄+y⁷ ₄·a⁷ ₅+y⁷ ₅·a⁷ ₈+y⁷ ₆·a⁷ ₉+y⁷ ₇·a⁷ ₁₀+y⁷ ₈·a⁷ ₁₁₀+y⁷ ₉·a⁷ ₁₄+y⁷ ₁₀·a⁷ ₁₅+y⁷ ₁₁·a⁷ ₁₆+y⁷ ₁₂·a⁷ ₁₇+y⁷ ₁₃·a⁷ ₂₀+y⁷ ₁₄·a⁷ ₂₁+y⁷ ₁₅·a⁷ ₂₂+y⁷ ₁₆·a⁷ ₂₃+y⁸ ₁·a⁸ ₂+y⁸ ₂·a⁸ ₃+y⁸ ₃·a⁸ ₄+y⁸ ₄·a⁸ ₅+y⁸ ₅·a⁸ ₈+y⁸ ₆·a⁸ ₉+y⁸ ₇·a⁸ ₁₀+y⁸ ₈·a⁸ ₁₁+y⁸ ₉·a⁸ ₁₄+y⁸ ₁₀·a⁸ ₁₅+y⁸ ₁₁·a⁸ ₁₆+y⁸ ₁₂·a⁸ ₁₇+y⁸ ₁₃·a⁸ ₂₀+y⁸ ₁₄·a⁸ ₂₁₀+y⁸ ₁₅·a⁸ ₂₂+y⁸ ₁₆·a⁸ ₂₃.

As shown above, output element o³ ₂ of converted output data matrix 216 is equal to output element o³ ₂ of output data matrix 206 ³.

Output element o³ is the dot product of the third row of converted weight matrix 212, i.e., converted weight tensor 212 ³, and the third column of converted input data matrix 214.

More particularly, output element o³ ₃ is equal to y¹ ₁·a¹ ₃+y¹ ₂·a¹ ₄+y¹ ₃·a¹ ₅+y¹ ₄·a¹ ₆+y¹ ₅·a¹ ₉+y¹ ₆·a¹ ₁₀+y¹ ₇·a¹ ₁₁+y¹ ₈·a¹ ₁₂+y¹ ₉·a¹ ₁₅+y¹ ₁₀·a¹ ₁₆+y¹ ₁₁·a¹ ₁₇+y¹ ₁₂·a¹ ₁₈+y¹ ₁₃·a¹ ₂₁+y¹ ₁₄·a¹ ₂₂+y¹ ₁₅·a¹ ₂₃+y¹ ₁₆·a¹ ₂₄+a²·a² ₃+y² ₂·a² ₄+y² ₃·a² ₅+y² ₄·a² ₆+y² ₅·a² ₉+y² ₆·a² ₁₀+y² ₇·a² ₁₁+y² ₈·a² ₁₂+y² ₉·a² ₁₅+y² ₁₀·a² ₁₆+y² ₁₁·a² ₁₇+y² ₁₂·a² ₁₈+y² ₁₃·a² ₂₁+y² ₁₄·a² ₂₂+y² ₁₅·a² ₂₃+y² ₁₆·a² ₂₄+y³ ₁·a³ ₃+y³ ₂·a³ ₄+y³ ₃·a³ ₅+y³ ₄·a³ ₆+y³ ₅·a³ ₉+y³ ₆·a³ ₁₀+y³ ₇·a³ ₁₁+y³ ₈·a³ ₁₂+y³ ₉·a³ ₁₅+y³ ₁₀·a³ ₁₆+y³ ₁₁·a³ ₁₇+y³ ₁₂·a³ ₁₈+y³ ₁₃·a³ ₂₁+y³ ₁₄·a³ ₂₂+y³ ₁₅·a³ ₂₃+y³ ₁₆·a³ ₂₄+y⁴ ₁·a⁴ ₃+y⁴ ₂·a⁴ ₄+y⁴ ₃·a⁴ ₅+y⁴ ₅·a⁴ ₆+y⁴ ₅·a⁴ ₉+y⁴ ₆·a⁴ ₁₀+y⁴ ₇·a⁴ ₁₁+y⁴ ₈·a⁴ ₁₂+y⁴ ₉·a⁴ ₁₅+y⁴ ₁₀·a⁴ ₁₆+y⁴ ₁₁·a⁴ ₁₇+y⁴ ₁₂·a⁴ ₁₈+y⁴ ₁₃·a⁴ ₂₁+y⁴ ₁₄·a⁴ ₂₂+y⁴ ₁₅·a⁴ ₂₃+y⁴ ₁₆·a⁴ ₂₄+y⁵ ₁·a⁵ ₃+y⁵ ₂·a⁵ ₄+y⁵ ₃·a⁵ ₅+y⁵ ₄·a⁵ ₆+y⁵ ₅·a⁵ ₉+y⁵ ₆·a⁵ ₁₀+y⁵ ₇·a⁵ ₁₁+y⁵ ₈·a⁵ ₁₂+y⁵ ₉·a⁵ ₁₅+y⁵ ₁₀·a⁵ ₁₆+y⁵ ₁₁·a⁵ ₁₇+y⁵ ₁₂·a⁵ ₁₈+y⁵ ₁₃·a⁵ ₂₁+y⁵ ₁₄·a⁵ ₂₂+y⁵ ₁₅·a⁵ ₂₃+y⁵ ₁₆·a⁵ ₂₄+y⁶ ₁·a⁶ ₃+y⁶ ₂·a⁶ ₄+y⁶ ₃·a⁶ ₅+y⁶ ₄·a⁶ ₆+y⁶ ₅·a⁶ ₉+y⁶ ₆·a⁶ ₁₀+y⁶ ₇·a⁶ ₁₁+y⁶ ₈·a⁶ ₁₂+y⁶ ₉·a⁶ ₁₅+y⁶ ₁₀·a⁶ ₁₆+y⁶ ₁·a⁶ ₁₇+y⁶ ₁₂·a⁶ ₁₈+y⁶ ₁₃·a⁶ ₂₁+y⁶ ₁₄·a⁶ ₂₂+y⁶ ₁₅·a⁶ ₂₃+y⁶ ₁₆·a⁶ ₂₄+y⁷ ₁·a⁷ ₃+y⁷ ₂·a⁷ ₄+y⁷ ₃·a⁷ ₅+y⁷ ₄·a⁷ ₆+y⁷ ₅·a⁷ ₉+y⁷ ₆·a⁷ ₁₀+y⁷ ₇·a⁷ ₁₁+y⁷ ₈·a⁷ ₁₂+y⁷ ₉·a⁷ ₁₅+y⁷ ₁₀·a⁷ ₁₆+y⁷ ₁₁·a⁷ ₁₇+y⁷ ₁₂·a⁷ ₁₈+y⁷ ₁₃·a² ₁+y⁷ ₁₄·a⁷ ₂₂+y⁷ ₁₅·a⁷ ₂₃+y⁷ ₁₆·a⁷ ₂₄+y⁸ ₁·a⁸ ₃+y⁸ ₂·a⁸ ₄+y⁸ ₃·a⁸ ₅+y⁸ ₄·a⁸ ₆+y⁸ ₅·a⁸ ₉+y⁸ ₆·a⁸ ₁₀+y⁸ ₇·a⁸ ₁₁+y⁸ ₈·a⁸ ₁₂+y⁸ ₉·a⁸ ₁₅+y⁸ ₁₀·a⁸ ₁₆+y⁸ ₁₁·a⁸ ₁₇+y⁸ ₁₂·a⁸ ₁₈+y⁸ ₁₃·a⁸ ₂₁+y⁸ ₁₄·a⁸ ₂₂+y⁸ ₁₅·a⁸ ₂₃+y⁸ ₁₆·a⁸ ₂₄.

As shown above, output element o³ ₃ of converted output data matrix 216 is equal to output element o³ ₃ of output data matrix 206 ³.

The remaining output elements of converted output data matrix 216 are calculated in a similar manner.

For converted output data matrix 216 ¹, output element o¹ ₄ is the dot product of converted weight tensor 212 ¹ and the fourth column of converted input data matrix 214, output element o¹ ₅ is the dot product of converted weight tensor 212 ¹ and the fifth column of converted input data matrix 214, output element o¹ ₆ is the dot product of converted weight tensor 212 ¹ and the sixth column of converted input data matrix 214, output element o¹ ₇ is the dot product of converted weight tensor 212 ¹ and the seventh column of converted input data matrix 214, output element o¹ ₈ is the dot product of converted weight tensor 212 ¹ and the eighth column of converted input data matrix 214, and output element o¹ ₉ is the dot product of converted weight tensor 212 ¹ and the ninth column of converted input data matrix 214.

For converted output data matrix 216 ², output element o² ₄ is the dot product of converted weight tensor 212 ² and the fourth column of converted input data matrix 214, output element o² ₅ is the dot product of converted weight tensor 212 ² and the fifth column of converted input data matrix 214, output element o² ₆ is the dot product of converted weight tensor 212 ² and the sixth column of converted input data matrix 214, output element o² ₇ is the dot product of converted weight tensor 212 ² and the seventh column of converted input data matrix 214, output element o² ₈ is the dot product of converted weight tensor 212 ² and the eighth column of converted input data matrix 214, and output element o² ₉ is the dot product of converted weight tensor 212 ² and the ninth column of converted input data matrix 214.

For converted output data matrix 216 ³, output element o³ ₄ is the dot product of converted weight tensor 212 ³ and the fourth column of converted input data matrix 214, output element o³ ₅ is the dot product of converted weight tensor 212 ³ and the fifth column of converted input data matrix 214, output element o³ ₆ is the dot product of converted weight tensor 212 ³ and the sixth column of converted input data matrix 214, output element o³ ₇ is the dot product of converted weight tensor 212 ³ and the seventh column of converted input data matrix 214, output element o³ ₈ is the dot product of converted weight tensor 212 ³ and the eighth column of converted input data matrix 214, and output element o³ ₉ is the dot product of converted weight tensor 212 ³ and the ninth column of converted input data matrix 214.

FIG. 3 depicts data flow diagram 220 for MAC array 228, in accordance with an embodiment of the present disclosure.

As noted above, GEMM operations may be implemented in a dedicated ANN hardware accelerator using an array of MAC units. In this embodiment, MAC array 228 is a systolic, output stationary array that implements converted convolution operation 210 using a 3×3 array of MAC units m₁, . . . , m₉. The orientation of transposed converted weight matrix 222, transposed converted input data matrix 224, and transposed converted output data matrix 226 relative to MAC array 228 simplifies illustration; other orientations are also contemplated. As discussed above, each MAC unit calculates a dot product, between a row of converted weight matrix 212 and a column of converted input data matrix 214, to generate an element of converted output data matrix 216.

Generally, a MAC unit includes, inter alia, a multiplier, an adder and a storage register. Each MAC unit is reset by clearing or zeroing its storage register prior to, or at the start of, a new dot product calculation. Generally, the rows from converted weight matrix 212 are read from local memory, a register, etc., enter MAC array 228 at the first row of MAC units m₁, m₂ and m₃, and propagate one MAC unit down at the beginning of each processing cycle. Similarly, the columns from converted input data matrix 214 are read from local memory, a register, etc., enter MAC array 228 at the first column of MAC units m₁, m₄ and m₇, and propagate one MAC unit to the right at the beginning of each processing cycle. The dot product calculations performed by MAC unit m₁ for the blocks of the first quadrants a¹ _(q1), a² _(q1) and a³ _(q1) of converted input data matrix 214 are discussed in detail below, while the dot product calculations performed by the remaining MAC units of MAC array 228 are summarized below.

MAC unit m₁ calculates the dot product of the first row of converted weight matrix 212 (i.e., converted weight tensor 212 ¹) and the first column of converted input data matrix 214 to generate element o¹ ₁ of converted output data matrix 216. During the processing cycle 1, MAC unit m₁ receives a¹ ₁ and w¹ ₁ from local memory, multiplies a¹ ₁ and w¹ ₁ to generate an intermediate product, adds the intermediate product to the value stored in the storage register (i.e., 0), and stores the accumulated result back in the storage register. During processing cycle 2, MAC unit m₁ transmits a₁ to MAC unit m₂ and w¹ ₁ to MAC unit m₄, receives a¹ ₂ and w¹ ₂ from local memory, multiplies a¹ ₂ and w¹ ₂ to generate an intermediate product, adds the intermediate product to the value stored in the storage register, and stores the accumulated result back in the storage register. During processing cycle 3, MAC unit m₁ transmits a¹ ₂ to MAC unit m₂ and w¹ ₂ to MAC unit m₄, receives a¹ ₃ and w¹ ₃ from local memory, multiplies a¹ ₃ and w¹ ₃ to generate an intermediate product, adds the intermediate product to the value stored in the storage register, and stores the accumulated result back in the storage register. During processing cycle 4, MAC unit m₁ transmits a¹ ₃ to MAC unit m₂ and w¹ ₃ to MAC unit m₄, receives a¹ ₄ and w¹ ₄ from the local memory, multiplies a¹ ₄ and w¹ ₄ to generate an intermediate product, adds the intermediate product to the value stored in the storage register, and stores the accumulated result back in the storage register.

Processing cycles 5 through 128 multiply and accumulate the remaining 124 elements of the first row of converted weight matrix 212 and the first column of converted input data matrix 214. At the end of the processing cycle 128, MAC unit m₁ outputs element o¹ ₁. The remainder of the first row of MAC array 228 includes MAC units m₂ and m₃.

After an initial delay of one processing cycle, MAC unit m₂ receives weights from the first delay register ff₁ and input data from MAC unit m₁, transmits weights to MAC unit m₅ and input data to MAC unit m₃, and calculates the dot product of the second row of converted weight matrix 212 (i.e., converted weight tensor 212 ²) and the first column of converted input data matrix 214 to generate element o² ₁ of converted output data matrix 216. The initial delay of one processing cycle allows the delay pipeline (i.e., delay register ff₁) to be filled with weights transferred from memory, and the input data to become available from MAC unit m₁. At the end of the processing cycle 129, MAC unit m₂ outputs element o² ₁.

After an initial delay of two processing cycles, MAC unit m₃ receives weights from the second delay register ff₂ and input data from MAC unit m₂, transmits weights to MAC unit m₆, and calculates the dot product of the third row of converted weight matrix 212 (i.e., converted weight tensor 212 ³) and the first column of converted input data matrix 214 to generate element o³ ₁ of converted output data matrix 216. The initial delay of two processing cycles allows the delay pipeline (i.e., delay registers ff₁ and ff₂) to be filled with weights transferred from memory, and the input data to become available from MAC unit m₂. At the end of processing cycle 130, MAC unit m₃ outputs element o³ ₁.

The second row of MAC array 228 includes MAC units m₄, m₅ and m₆. After an initial delay of one processing cycle, MAC unit m₄ receives weights from MAC unit m₁ and input data from a first delay register ff₁, transmits weights to MAC unit m₇ and input data to MAC unit m₅, and calculates the dot product of the first row of converted weight matrix 212 (i.e., converted weight tensor 212 ¹) and the second column of converted input data matrix 214 to generate element o¹ ₂ of converted output data matrix 216. The initial delay of one processing cycle allows the delay pipeline (i.e., delay register ff₁) to be filled with input data transferred from memory, and the weights to become available from MAC unit m₁. At the end of processing cycle 129, MAC unit m₄ outputs element o¹ ₂.

After an initial delay of two processing cycles, MAC unit m₅ receives weights from MAC unit m₂ and input data from MAC unit m₄, transmits weights to MAC unit ma and input data to MAC unit m₆, and calculates the dot product of the second row of converted weight matrix 212 (i.e., converted weight tensor 212 ²) and the second column of converted input data matrix 214 to generate element o² ₂ of converted output data matrix 216. The initial delay of two processing cycles allows the weights to become available from MAC unit m₂, and the input data to become available from MAC unit m₄. At the end of processing cycle 130, MAC unit m₅ outputs element o² ₂.

After an initial delay of three processing cycles, MAC unit m₆ receives weights from MAC unit m₃ and input data from MAC unit m₅, transmits weights to MAC unit m₉, and calculates the dot product of the third row of converted weight matrix 212 (i.e., converted weight tensor 212 ³) and the second column of converted input data matrix 214 to generate element o³ ₂ of converted output data matrix 216. The initial delay of three processing cycles allows the weights to become available from MAC unit m₃, and the input data to become available from MAC unit m₅. At the end of processing cycle 131, MAC unit m₆ outputs element o³ ₂.

The third row of MAC array 228 includes MAC units m₇, m₈ and m₉.

After an initial delay of two processing cycles, MAC unit m₇ receives weights from MAC unit m₄ and input data from a second delay register ff₂, transmits input data to MAC unit m₈, and calculates the dot product of the first row of converted weight matrix 212 (i.e., converted weight tensor 212 ¹) and the third column of converted input data matrix 214 to generate element o¹ ₃ of converted output data matrix 216. The initial delay of two processing cycles allows the delay pipeline (i.e., delay registers ff₁ and ff₂) to be filled with input data transferred from memory, and the weights to become available from MAC unit m₄. At the end of processing cycle 130, MAC unit m₇ outputs element o¹ ₃.

After an initial delay of three processing cycles, MAC unit ma receives weights from MAC unit m₅ and input data from MAC unit m₇, transmits input data to MAC unit m₈, and calculates the dot product of the second row of converted weight matrix 212 (i.e., converted weight tensor 212 ²) and the third column of converted input data matrix 214 to generate element o² ₃ of converted output data matrix 216. The initial delay of three processing cycles allows the weights to become available from MAC unit m₅, and the input data to become available from MAC unit m₇. At the end of processing cycle 131, MAC unit ma outputs element o² ₃.

After an initial delay of four processing cycles, MAC unit me receives weights from MAC unit m₆ and input data from MAC unit m₈, and calculates the dot product of the third row of converted weight matrix 212 (i.e., converted weight tensor 212 ³) and the third column of converted input data matrix 214 to generate element o³ ₃ of converted output data matrix 216. The initial delay of four processing cycles allows the weights to become available from MAC unit m₆, and the input data to become available from MAC unit me. At the end of processing cycle 132, MAC unit me outputs element o³ ₃.

After the blocks of the first quadrants a¹ _(q1), a² _(q1), and a³ _(q1) of converted input data matrix 214 have been processed, the next sequence of operations processes the blocks of the second quadrants a¹ _(q2), a² _(q2), and a³ _(q2). After the blocks of the second quadrants a¹ _(q2), a² _(q2), and a³ _(q2) have been processed, the next sequence of operations processes the blocks of the third quadrants a¹ _(q3), a² _(q3), and a³ _(q3). Converted weight matrix 212 is accessed for each sequence of operations.

Unfortunately, for CNNs executing on CPUs, GPUs, NPUs, etc., GEMM operations consume a significant number of processor cycles due to the large number of multiplications that are required. For example, one known image recognition CNN requires 3 giga operations per second (GOPS) per input data frame. Compounding this problem, many of the ANN matrices upon which GEMM operations are performed are sparse, which produces a very inefficient use of storage resources. More particularly, CNN weight tensors are stored in a dense or uncompressed form, even though the weights typically contain a significant amount of zero values.

Embodiments of the present disclosure advantageously provide a matrix encoding process that reduces storage requirements and provides flexibility in both quantization and pruning within a fixed block size format. More particularly, embodiments of the present disclosure advantageously provide a block-based encoding process for ANN matrices, such as weight tensors, that is of fixed storage size, but allows trade-off in the matrix elements between zero values, smaller magnitude values and larger magnitude values. Many embodiments of the present disclosure also provide a fixed computation size per block, which is advantageous in many situations

FIGS. 4A, 4B and 4C depict weight tensors 202 ¹, 202 ² and 202 ³ respectively, in accordance with an embodiment of the present disclosure.

As discussed above, each weight tensor 202 ^(i) includes one 4×4 weight matrix for each input channel, i.e., weight matrices 202 ^(i) ₁, 202 ^(i) ₂, 202 ^(i) ₃, 202 ^(i) ₄, 202 ^(i) ₅, 202 ^(i) ₆, 202 ^(i) ₇ and 202 ^(i) ₈, so weight tensor 202 ¹ includes weight matrices 202 ¹ ₁, 202 ¹ ₂, 202 ¹ ₃, 202 ¹ ₄, 202 ¹ ₅, 202 ¹ ₆, 202 ¹ ₇ and 202 ¹ ₈, weight tensor 202 ² includes weight matrices 202 ² ₁, 202 ² ₂, 202 ² ₃, 202 ² ₄, 202 ² ₅, 202 ² ₆, 202 ² ₇ and 202 ² ₈, and weight tensor 202 ³ includes weight matrices 202 ³ ₁, 202 ³ ₂, 202 ³ ₃, 202 ³ ₄, 202 ³ ₅, 202 ³ ₆, 202 ³ ₇ and 202 ³ ₈.

FIGS. 5A, 5B and 5C depict basic block sets 302 ¹, 302 ² and 302 ³, respectively, in accordance with an embodiment of the present disclosure.

In one embodiments, each weight tensor 202′ may be decomposed into a basic block set 302 ^(i) that includes 16 basic blocks 302 ^(i) _(j) along the depth or channel dimension. Each basic block is a 1×1×b tensor, and includes b weight values. The depth, b, is a hyperparameter having a value of 4, 8, 16, etc. In this embodiment, b equals 8, i.e., the number of channels.

More particularly, basic block set 302 ¹ includes basic blocks 302 ¹ ₁, 302 ¹ ₂, 302 ¹ ₃, 302 ¹ ₄, 302 ¹ ₅, 302 ¹ ₆, 302 ¹ ₇, 302 ¹ ₈, 302 ¹ ₉, 302 ¹ ₁₀, 302 ¹ ₁₁, 302 ¹ ₁₂, 302 ¹ ₁₃, 302 ¹ ₁₄, 302 ¹ ₁₅ and 302 ¹ ₁₆, basic block set 302 ² includes basic blocks 302 ² ₁, 302 ² ₂, 302 ² ₃, 302 ² ₄, 302 ² ₅, 302 ² ₆, 302 ² ₇, 302 ² ₈, 302 ² ₉, 302 ² ₁₀, 302 ² ₁₁, 302 ² ₁₂, 302 ² ₁₃, 302 ² ₁₄, 302 ² ₁₅ and 302 ² ₁₆, and basic block set 302 ³ includes basic blocks 302 ³ ₁, 302 ³ ₂, 302 ³, 302 ³ ₄, 302 ³ ₅, 302 ³ ₆, 302 ³ ₇, 302 ³ ₈, 302 ³ ₉, 302 ³ ₁₀, 302 ³ ₁₁, 302 ³ ₁₂, 302 ³ ₁₃, 302 ³ ₁₄, 302 ³ ₁₅ and 302 ³ ₁₆.

With respect to basic block set 302 ¹, basic block 302 ¹ ₁ includes weights w¹ ₁, w² ₁, w³ ₁, w⁴ ₁, w⁵ ₁, w⁶ ₁, w⁷ ₁ and w⁸ ₁, basic block 302 ¹ ₂ includes weights w¹ ₂, w² ₂, w³ ₂, w⁴ ₂, w⁵ ₂, w⁶ ₂, w⁷ ₂ and w⁸ ₂, basic block 302 ¹ ₃ includes weights w¹ ₃, w² ₃, w³ ₃, w⁴ ₃, w⁵ ₃, w⁶ ₃, w⁷ ₃ and w⁸ ₃, basic block 302 ¹ ₄ includes weights w¹ ₄, w² ₄, w³ ₄, w⁴ ₄, w⁵ ₄, w⁶ ₄, w⁷ ₄ and w⁸ ₄, basic block 302 ¹ ₅ includes weights w¹ ₅, w² ₅, w³ ₅, w⁴ ₅, w⁵ ₅, w⁶ ₅, w⁷ ₅ and w⁸ ₅, basic block 302 ¹ ₆ includes weights w¹ ₆, w² ₆, w³ ₆, w⁴ ₆, w⁵ ₆, w⁶ ₆, w⁷ ₆ and w⁸ ₆, basic block 302 ¹ ₇ includes weights w¹ ₇, w² ₇, w³ ₇, w⁴ ₇, w⁵ ₇, w⁶ ₇, w⁷ ₇ and w⁸ ₇, basic block 302 ¹ ₈ includes weights w¹ ₈, w² ₈, w³ ₈, w⁴ ₈, w⁵ ₈, w⁶ ₈, w⁷ ₈ and w⁸ ₈, basic block 302 ¹ ₉ includes weights w¹ ₉, w² ₉, w³ ₉, w⁴ ₉, w⁵ ₉, w⁶ ₉, w⁷ ₉ and w⁸ ₉, basic block 302 ¹ ₁₀ includes weights w¹ ₁₀, w² ₁₀, w³ ₁₀, w⁴ ₁₀, w⁵ ₁₀, w⁶ ₁₀, w⁷ ₁₀ and w⁸ ₁₀, basic block 302 ¹ ₁₁ includes weights w¹ ₁₁, w² ₁₁, w³ ₁₁, w⁴ ₁₁, w⁵ ₁₁, w⁶ ₁₁, w⁷ ₁₁ and w⁸ ₁₁, basic block 302 ¹ ₁₂ includes weights w¹ ₁₂, w² ₁₂, w³ ₁₂, w⁴ ₁₂, w⁵ ₁₂, w⁶ ₁₂, w⁷ ₁₂ and w⁸ ₁₂, basic block 302 ¹ ₁₃ includes weights w¹ ₁₃, w² ₁₃, w³ ₁₃, w⁴ ₁₃, w⁵ ₁₃, w⁶ ₁₃, w⁷ ₁₃ and w⁸ ₁₃, basic block 302 ¹ ₁₄ includes weights w¹ ₁₄, w² ₁₄, w³ ₁₄, w⁴ ₁₄, w⁵ ₁₄, w⁶ ₁₄, w⁷ ₁₄ and w⁸ ₁₄, basic block 302 ¹ ₁₅ includes weights w¹ ₁₅, w² ₁₅, w³ ₁₅, w⁴ ₁₅, w⁵ ₁₅, w⁶ ₁₅, w⁷ ₁₅ and w⁸ ₁₅, and basic block 302 ¹ ₁₆ includes weights w¹ ₁₆, w² ₁₆, w³ ₁₆, w⁴ ₁₆, w⁵ ₁₆, w⁶ ₁₆, w⁷ ₁₆ and w⁸ ₁₆. Basic blocks 302 ¹ ₅, 302 ¹ ₆, 302 ¹ ₇, 302 ¹ ₈, 302 ¹ ₉, 302 ¹ ₁₀, 302 ¹ ₁₁, 302 ¹ ₁₂ are not depicted in FIG. 5A for clarity.

With respect to basic block set 302 ², basic block 302 ² ₁ includes weights x¹ ₁, x² ₁, x³ ₁, x⁴ ₁, x⁵ ₁, x⁶ ₁, x⁷ ₁ and x⁸ ₁, basic block 302 ² ₂ includes weights x¹ ₂, x² ₂, x³ ₂, x⁴ ₂, x⁵ ₂, x⁶ ₂, x⁷ ₂ and x⁸ ₂, basic block 302 ² ₃ includes weights x¹ ₃, x² ₃, x³ ₃, x⁴ ₃, x⁵ ₃, x⁶ ₃, x⁷ ₃ and x⁸ ₃, basic block 302 ² ₄ includes weights x¹ ₄, x² ₄, x³ ₄, x⁴ ₄, x⁵ ₄, x⁶ ₄, x⁷ ₄ and x⁸ ₄, basic block 302 ² ₅ includes weights x¹ ₅, x² ₅, x³ ₅, x⁴ ₅, x⁵ ₅, x⁶ ₅, x⁷ ₅ and x⁸ ₅, basic block 302 ² ₆ includes weights x¹ ₆, x² ₆, x³ ₆, x⁴ ₆, x⁵ ₆, x⁶ ₆, x⁷ ₆ and x⁸ ₆, basic block 302 ² ₇ includes weights x¹ ₇, x² ₇, x³ ₇, x⁴ ₇, x⁵ ₇, x⁶ ₇, x⁷ ₇ and x⁸ ₇, basic block 302 ² ₈ includes weights x¹ ₈, x² ₈, x³ ₈, x⁴ ₈, x⁵ ₈, x⁶ ₈, x⁷ ₈ and x⁸, basic block 302 ² ₉ includes weights x¹ ₉, x² ₉, x³ ₉, x⁴ ₉, x⁵ ₉, x⁶ ₉, x⁷ ₉ and x⁸ ₉, basic block 302 ² ₁₀ includes weights x¹ ₁₀, x² ₁₀, x³ ₁, x⁴ ₁₀, x⁵ ₁₀, x⁶ ₁₀, x⁷ ₁ and x⁸ ₁₀, basic block 302 ² ₁₁ includes weights x¹ ₁₁, x² ₁₁, x³ ₁₁, x⁴ ₁₁, x⁵ ₁₁, x⁶ ₁₁, x⁷ ₁₁ and x⁸ ₁₁, basic block 302 ² ₁₂ includes weights x¹ ₁₂, x² ₁₂, x³ ₁₂, x⁴ ₁₂, x⁵ ₁₂, x⁶ ₁₂, x⁷ ₁₂ and x⁸ ₁₂, basic block 302 ² ₁₃ includes weights x¹ ₁₃, x² ₁₃, x³ ₁₃, x⁴ ₁₃, x⁵ ₁₃, x⁶ ₁₃, x⁷ ₁₃ and x⁸ ₁₃, basic block 302 ² ₁₄ includes weights x¹ ₁₄, x² ₁₄, x³ ₁₄, x⁴ ₁₄, x⁵ ₁₄, x⁶ ₁₄, x⁷ ₁₄ and x⁸ ₁₄, basic block 302 ² ₁₅ includes weights x¹ ₁₅, x² ₁₅, x³ ₁₅, x⁴ ₁₅, x⁵ ₁₅, x⁶ ₁₅, x⁷ ₁₅ and x⁸ ₁₅, and basic block 302 ² ₁₆ includes weights x¹ ₁₆, x² ₁₆, x³ ₁₆, x⁴ ₁₆, x⁵ ₁₆, x⁶ ₁₆, x⁷ ₁₆ and x⁸ ₁₆. Basic blocks 302 ² ₅, 302 ² ₆, 302 ² ₇, 302 ² ₈, 302 ² ₉, 302 ² ₁₀, 302 ² ₁₁, 302 ² ₁₂ are not depicted in FIG. 5B for clarity.

With respect to basic block set 302 ³, basic block 302 ³ ₁ includes weights y¹ ₁, y² ₁, y³ ₁, y⁴ ₁, y⁵ ₁, y⁶ ₁, y⁷ ₁ and y⁸ ₁, basic block 302 ³ ₂ includes weights y¹ ₂, y² ₂, y³ ₂, y⁴ ₂, y⁵ ₂, y⁶ ₂, y⁷ ₂ and y⁸ ₂, basic block 302 ³ includes weights y¹ ₃, y² ₃, y³ ₃, y⁴ ₃, y⁵ ₃, y⁶ ₃, y⁷ ₃ and y⁸ ₃, basic block 302 ³ ₄ includes weights y¹ ₄, y² ₄, y³ ₄, y⁴ ₄, y⁵ ₄, y⁶ ₄, y⁷ ₄ and y⁸ ₄, basic block 302 ³ ₅ includes weights y¹, y², y³, y⁴, y⁵ ₅, y⁶ ₅, y⁷ ₅ and y⁸ ₅, basic block 302 ³ ₆ includes weights y¹ ₆, y² ₆, y³ ₆, y⁴ ₆, y⁵ ₆, y⁶ ₆, y⁷ ₆ and y⁸ ₆, basic block 302 ³ ₇ includes weights y¹ ₇, y² ₇, y³ ₇ y⁴ ₇, y⁵ ₇, y⁶ ₇, y⁷ ₇ and y⁸ ₇, basic block 302 ³ ₈ includes weights y¹, y², y³, y⁴, y⁵, y⁶ ₈, y⁷ ₈ and y⁸ ₈, basic block 302 ³ ₉ includes weights y¹ ₉, y² ₉, y³ ₉, y⁴ ₉, y⁵ ₉, y⁶ ₉, y⁷ ₉ and y⁸ ₉, basic block 302 ³ ₁₀ includes weights y¹ ₁₀, y² ₁₀, y³ ₁₀, y⁴ ₁₀, y⁵ ₁₀, y⁶ ₁₀, y⁷ ₁₀ and y ⁸ ₁₀, basic block 302 ³ ₁₁ includes weights y¹ ₁₁, y² ₁₁, y³ ₁₁, y⁴ ₁₁, y⁵ ₁₁, y⁶ ₁₁, y⁷ ₁₁ and y⁸ ₁₁, basic block 302 ³ ₁₂ includes weights y¹ ₁₂, y² ₁₂, y³ ₁₂, y⁴ ₁₂, y⁵ ₁₂, y⁶ ₁₂, y⁷ ₁₂ and y⁸ ₁₂, basic block 302 ³ ₁₃ includes weights y¹ ₁₃, y² ₁₃, y³ ₁₃, y⁴ ₁₃, y⁵ ₁₃, y⁶ ₁₃, y⁷ ₁₃ and y⁸ ₁₃, basic block 302 ³ ₁₄ includes weights y¹ ₁₄, y² ₁₄, y³ ₁₄, y⁴ ₁₄, y⁵ ₁₄, y⁶ ₁₄, y⁷ ₁₄ and y⁸ ₁₄, basic block 302 ³ ₁₅ includes weights y¹ ₁₅, y² ₁₅, y³ ₁₅, y⁴ ₁₅, y⁵ ₁₅, y⁶ ₁₅, y⁷ ₁₅ and y⁸ ₁₅, and basic block 302 ³ ₁₆ includes weights y¹ ₁₆, y² ₁₆, y³ ₁₆, y⁴ ₁₆, y⁵ ₁₆, y⁶ ₁₆, y⁷ ₁₆ and y⁸ ₁₆. Basic blocks 302 ³ ₅, 302 ³ ₆, 302 ³ ₇, 302 ³ ₈, 302 ³ ₉, 302 ³ ₁₀, 302 ³ ₁₁, 302 ³ ₁₂ are not depicted in FIG. 5C for clarity.

FIGS. 6A, 6B and 6C depict basic block matrix sets 312 ¹, 312 ² and 312 ³ respectively, in accordance with an embodiment of the present disclosure.

Basic block set 302 ^(i) may be reformed into a basic block matrix set 312 ^(i) that includes 16 respective basic block matrices 312 ^(i) ₁, i.e., basic block matrices 312 ^(i) ₁, 312 ^(i) ₂, 312 ^(i) ₃, 312 ^(i) ₄, 312's, 312 ^(i) ₆, 312 ^(i) ₇, 312 ^(i) ₈, 312 ^(i) ₉, 312 ^(i) ₁₀, 312 ^(i) ₁₁, 312 ^(i) ₁₂, 312 ^(i) ₁₃, 312 ^(i) ₁₄, 312 ^(i) ₁₅ and 312 ^(i) ₁₆. Each basic block matrix 312 ^(i) _(j) has 8 rows and a single column (8×1), and the same weights as the respective basic block 304 ¹ _(j).

More particularly, basic block matrix set 312 ¹ includes basic block matrices 312 ¹ ₁, 312 ¹ ₂, 312 ¹ ₃, 312 ¹ ₄, 312 ¹ ₅, 312 ¹ ₆, 312 ¹ ₇, 312 ¹ ₈, 312 ¹ ₉, 312 ¹ ₁₁, 312 ¹ ₁₁, 312 ¹ ₁₂, 312 ¹ ₁₃, 312 ¹ ₁₄, 312 ¹ ₁₅ and 312 ¹ ₁₆, basic block matrix set 312 ² includes basic block matrices 312 ² ₁, 312 ² ₂, 312 ² ₃, 312 ² ₄, 312 ² ₅, 312 ² ₆, 312 ² ₇, 312 ² ₈, 312 ² ₉, 312 ² ₁₁, 312 ² ₁₁, 312 ² ₁₂, 312 ² ₁₃, 312 ² ₁₄, 312 ² ₁₅ and 312 ² ₁₆, and basic block matrix set 312 ³ includes basic block matrices 312 ³ ₁, 312 ³ ₂, 312 ³ ₃, 312 ³ ₄, 312 ³ ₅, 312 ³ ₆, 312 ³ ₇, 312 ³ ₈, 312 ³ ₉, 312 ³ ₁₁, 312 ³ ₁₁, 312 ³ ₁₂, 312 ³ ₁₃, 312 ³ ₁₄, 312 ³ ₁₅ and 312 ³ ₁₆.

With respect to basic block matrix set 312 ¹, basic block matrix 312 ¹ ₁ includes weights w¹ ₁, w² ₁, w³ ₁, w⁴ ₁, w⁵ ₁, w⁶ ₁, w⁷ ₁ and w⁸ ₁, basic block matrix 312 ¹ ₂ includes weights w¹ ₂, w² ₂, w³ ₂, w⁴ ₂, w⁵ ₂, w⁶ ₂, w⁷ ₂ and w⁸ ₂, basic block matrix 312 ¹ ₃ includes weights w¹ ₃, w² ₃, w³ ₃, w⁴ ₃, w⁵ ₃, w⁶ ₃, w⁷ ₃ and w⁸ ₃, basic block matrix 312 ¹ ₄ includes weights w¹ ₄, w² ₄, w³ ₄, w⁴ ₄, w⁵ ₄, w⁶ ₄, w⁷ ₄ and w⁸ ₄, basic block matrix 312 ¹ ₅ includes weights w¹ ₅, w² ₅, w³ ₅, w⁴ ₅, w⁵ ₅, w⁶ ₅, w⁷ ₅ and w⁸ ₅, basic block matrix 312 ¹ ₆ includes weights w¹ ₆, w² ₆, w³ ₆, w⁴ ₆, w⁵ ₆, w⁶ ₆, w⁷ ₆ and w⁸ ₆, basic block matrix 312 ¹ ₇ includes weights w¹ ₇, w² ₇, w³ ₇, w⁴ ₇, w⁵ ₇, w⁶ ₇, w⁷ ₇ and w⁸ ₇, basic block matrix 312 ¹ ₈ includes weights w¹ ₈, w² ₈, w³ ₈, w⁴ ₈, w⁵ ₈, w⁶ ₈, w⁷ ₈, w and w⁸ ₈, basic block matrix 312 ¹ ₉ includes weights w¹ ₉, w² ₉, w³ ₉, w⁴ ₉, w⁵ ₉, w⁶ ₉, w⁷ ₉ and w⁸ ₉, basic block matrix 302 ¹ ₁₀ includes weights w¹ ₁₀, w² ₁₀, w³ ₁₀, w⁴ ₁₀, w⁵ ₁₀, w⁶ ₁₀, w⁷ ₁₀ and w⁸ ₁₀, basic block matrix 312 ¹ ₁₁ includes weights w¹ ₁₁, w² ₁₁, w³ ₁₁, w⁴ ₁₁, w⁵ ₁₁, w⁶ ₁₁, w⁷ ₁₁ and w⁸ ₁₁, basic block matrix 312 ¹ ₁₂ includes weights w¹ ₁₂, w² ₁₂, w³ ₁₂, w⁴ ₁₂, w⁵ ₁₂, w⁶ ₁₂, w⁷ ₁₂ and w⁸ ₁₂, basic block matrix 312 ¹ ₁₃ includes weights w¹ ₁₃, w² ₁₃, w³ ₁₃, w⁴ ₁₃, w⁵ ₁₃, w⁶ ₁₃, w⁷ ₁₃ and w⁸ ₁₃, basic block matrix 312 ¹ ₁₄ includes weights w¹ ₁₄, w² ₁₄, w³ ₁₄, w⁴ ₁₄, w⁵ ₁₄, w⁶ ₁₄, w⁷ ₁₄ and w⁸ ₁₄, basic block matrix 312 ¹ ₁₅ includes weights w¹ ₁₅, w² ₁₅, w³ ₁₅, w⁴ ₁₅, w⁵ ₁₅, w⁶ ₁₅, w⁷ ₁₅ and w⁸ ₁₅, and basic block matrix 312 ¹ ₁₆ includes weights w¹ ₁₆, w² ₁₆, w³ ₁₆, w⁴ ₁₆, w⁵ ₁₆, w⁶ ₁₆, w⁷ ₁₆ and w⁸ ₁₆.

With respect to basic block matrix set 312 ², basic block matrix 312 ² ₁ includes weights x¹ ₁, x² ₁, x³ ₁, x⁴ ₁, x⁵ ₁, x⁶ ₁, x⁷ ₁ and x⁸ ₁, basic block matrix 312 ² ₂ includes weights x¹ ₂, x² ₂, x³ ₂, x⁴ ₂, x⁵ ₂, x⁶ ₂, x⁷ ₂ and x⁸ ₂, basic block matrix 312 ² ₃ includes weights x¹ ₃, x² ₃, x³ ₃, x⁴ ₃, x⁵ ₃, x⁶ ₃, x⁷ ₃ and x⁸ ₃, basic block matrix 312 ² ₄ includes weights x¹ ₄, x² ₄, x³ ₄, x⁴ ₄, x⁵ ₄, x⁶ ₄, x⁷ ₄ and x⁸ ₄, basic block matrix 312 ² ₅ includes weights x¹ ₅, x² ₅, x³ ₅, x⁴ ₅, x⁵ ₅, x⁶ ₅, x⁷ ₅ and x⁸ ₅, basic block matrix 312 ² ₆ includes weights x¹ ₆, x² ₆, x³ ₆, x⁴ ₆, x⁵ ₆, x⁶ ₆, x⁷ ₆ and x⁸ ₆, basic block matrix 312 ² ₇ includes weights x¹ ₇, x² ₇, x³ ₇, x⁴ ₇, x⁵ ₇, x⁶ ₇, x⁷ ₇ and x⁸ ₇, basic block matrix 312 ² ₈ includes weights x¹ ₈, x² ₈, x³ ₈, x⁴ ₈, x⁵ ₈, x⁶ ₈, x⁷ ₈ and x⁸ ₈, basic block matrix 312 ² ₉ includes weights x¹ ₉, x² ₉, x³ ₉, x⁴ ₉, x⁵ ₉, x⁶ ₉, x⁷ ₉ and x⁸ ₉, basic block matrix 312 ² ₁₀ includes weights x¹ ₁₀, x² ₁₀, x³ ₁₀, x⁴ ₁₀, x⁵ ₁₀, x⁶ ₁₀, x⁷ ₁₀ and x⁸ ₁₀, basic block matrix 312 ² ₁₁ includes weights x¹ ₁₁, x² ₁₁, x³ ₁₁, x⁴ ₁₁, x⁵ ₁₁, x⁶ ₁₁, x⁷ ₁₁ and x⁸ ₁₁, basic block matrix 312 ² ₁₂ includes weights x¹ ₁₂, x² ₁₂, x³ ₁₂, x⁴ ₁₂, x⁵ ₁₂, x⁶ ₁₂, x⁷ ₁₂ and x⁸ ₁₂, basic block matrix 312 ² ₁₃ includes weights x¹ ₁₃, x² ₁₃, x³ ₁₃, x⁴ ₁₃, x⁵ ₁₃, x⁶ ₁₃, x⁷ ₁₃ and x⁸ ₁₃, basic block matrix 312 ² ₁₄ includes weights x¹ ₁₄, x² ₁₄, x³ ₁₄, x⁴ ₁₄, x⁵ ₁₄, x⁶ ₁₄, x⁷ ₁₄ and x⁸ ₁₄, basic block matrix 312 ² ₁₅ includes weights x¹ ₁₅, x² ₁₅, x³ ₁₅, x⁴ ₁₅, x⁵ ₁₅, x⁶ ₁₅, x⁷ ₁₅ and x⁸ ₁₅, and basic block matrix 312 ² ₁₆ includes weights x¹ ₁₆, x² ₁₆, x³ ₁₆, x⁴ ₁₆, x⁵ ₁₆, x⁶ ₁₆, x⁷ ₁₆ and x⁸ ₁₆.

With respect to basic block matrix set 312 ³, basic block matrix 312 ³ ₁ includes weights y¹ ₁, y² ₁, y³ ₁, y⁴ ₁, y⁵ ₁, y⁶ ₁, y⁷ ₁ and y⁸ ₁, basic block matrix 312 ³ ₂ includes weights y¹ ₂, y² ₂, y³ ₂, y⁴ ₂, y⁵ ₂, y⁶ ₂, y⁷ ₂ and y⁸ ₂, basic block matrix 312 ³ ₃ includes weights y¹ ₃, y² ₃, y³ ₃, y⁴ ₃, y⁵ ₃, y⁶ ₃, y⁷ ₃ and y⁸ ₃, basic block matrix 312 ³ ₄ includes weights y¹ ₄, y² ₄, y³ ₄, y⁴ ₄, y⁵ ₄, y⁶ ₄, y⁷ ₄ and y⁸ ₄, basic block matrix 312 ³ ₅ includes weights y¹ ₅, y² ₅, y³ ₅, y⁴ ₅, y⁵ ₅, y⁶ ₅, y⁷ ₅ and y⁸ ₅, basic block matrix 312 ³ ₆ includes weights y¹ ₆, y² ₆, y³ ₆, y⁴ ₆, y⁵ ₆, y⁶ ₆, y⁷ ₆ and y⁸ ₆, basic block matrix 312 ³ ₇ includes weights y¹ ₇, y² ₇ y³ ₇ y⁴ ₇ y⁵ ₇ y⁶ ₇ y⁷ ₇ and y⁸ ₇, basic block matrix 312 ³ ₈ includes weights y¹ ₈, y² ₈, y³ ₈, y⁴ ₈, y⁵ ₈, y⁶ ₈, y⁷ ₈ and y⁸ ₈, basic block matrix 312 ³ ₉ includes weights y¹ ₉, y² ₉, y³ ₉, y⁴ ₉, y⁵ ₉, y⁶ ₉, y⁷ ₉ and y⁸ ₉, basic block matrix 312 ³ ₁₀ includes weights y¹ ₁₀, y² ₁₀, y³ ₁₀, y⁴ ₁₀, y⁵ ₁₀, y⁶ ₁₀, y⁷ ₁₀ and y⁸ ₁₀, basic block matrix 312 ³ ₁₁ includes weights y¹ ₁₁, y² ₁₁, y³ ₁₁, y⁴ ₁₁, y⁵ ₁₁, y⁶ ₁₁, y⁷ ₁₁ and y⁸ ₁₁, basic block matrix 312 ³ ₁₂ includes weights y¹ ₁₂, y² ₁₂, y³ ₁₂, y⁴ ₁₂, y⁵ ₁₂ y⁶ ₁₂, y⁷ ₁₂ and y⁸ ₁₂, basic block matrix 312 ³ ₁₃ includes weights y¹ ₁₃, y² ₁₃, y³, y⁴ ₁₃, y⁵ ₁₃ y⁶ ₁₃, y⁷ ₁₃ and y⁸ ₁₃, basic block matrix 312 ³ ₁₄ includes weights y¹ ₁₄, y² ₁₄, y³ ₁₄, y⁴ ₁₄, y⁵ ₁₄ y⁶ ₁₄, y⁷ ₁₄ and y⁸ ₁₄, basic block matrix 312 ³ ₁₅ includes weights y¹ ₁₅, y² ₁₅, y³ ₁₅, y⁴ ₁₅, y⁵ ₁₅, y⁶ ₁₅, y⁷ ₁₅ and y⁸ ₁₅, and basic block matrix 312 ³ ₁₆ includes weights y¹ ₁₆, y² ₁₆, y³ ₁₆, y⁴ ₁₆, y⁵ ₁₆, y⁶ ₁₆, y⁷ ₁₆ and y⁸ ₁₆.

Because weight tensors 202 ¹, 202 ² and 202 ³ have been reformed into basic block matrix sets 312 ¹, 312 ² and 312 ³ across the channel dimension rather than the height and width dimensions, encoding the weights of each basic block matrix advantageously avoids adverse effects on the local data within any particular weight matrix.

FIG. 7A depicts matrix element encoding process 350, according to an embodiment of the present disclosure.

In this embodiment, the matrix element is an 8-bit unsigned integer weight 322. The principles discussed below are applicable to other types of matrix elements, such as, for example, 8-bit unsigned integer activations, an 8-bit signed integer weights or activations, a 16-bit signed or unsigned integer weights or activations, a 32-bit signed or unsigned integer weights or activations, certain floating point formats, etc.

In this embodiment, the encoding process for the 8-bit unsigned integer matrix element advantageously includes both quantization, to reduce the number of bits, and pruning to remove small values at or close to zero. Weight 322 may be encoded as a 4-bit “zero magnitude” unsigned integer value, i.e., encoded weight 322 ¹, a 4-bit “small magnitude” unsigned integer value, i.e., encoded weight 322 ², a 4-bit “large magnitude” unsigned integer value, i.e., encoded weight 322 ³, or an 8-bit “full magnitude” unsigned integer value, i.e., encoded weight 322 ⁴. In another embodiment, the encoding process for the 8-bit signed integer matrix element uses 4-bit “zero magnitude”, “small magnitude” and “large magnitude” signed integer values and an 8-bit “full magnitude” signed integer value.

The 4-bit “large magnitude” unsigned integer value is similar to an 8-bit unsigned integer value with a zero-valued least significant bit (LSB) nibble, while the 4-bit “small magnitude” unsigned integer value is similar to an 8-bit unsigned integer value with a zero-valued most significant bit (MSB) nibble. The 4-bit “large magnitude” unsigned integer value is created by shifting the matrix element 4 bits to the right, and the matrix element is reconstructed by shifting the 4-bit “large magnitude” unsigned integer value 4 bits to the left. The information contained in the lower 4 bits (i.e., the LSB nibble) of the matrix element is lost during the encoding and recreation processes.

Additional encoding types are also contemplated, such as, for example, an 8-bit “medium magnitude” integer for a¹ ₆-bit integer matrix element, an 8-bit “small magnitude” integer for a³ ₂-bit integer matrix element, etc.

Each type of encoding has an associated 2-bit index 324, represented as “i:1, i:0”, that identifies or describes the encoding type. Weight 322 ¹ has an associated 2-bit index 324 ¹ to identify the “zero magnitude” type, and has a bit pattern of “00”. Weight 322 ² has an associated 2-bit index 324 ² to identify the “small magnitude” type, and has a bit pattern of “10”. Weight 322 ³ has an associated 2-bit index 324 ³ to identify the “large magnitude” type, and has a bit pattern of “11”. Weight 322 ⁴ has an associated 2-bit index 324 ⁴ to identify the “full magnitude” type, and has a bit pattern of “01”.

Other index values and numbers of bits may also be used, such as, for example, a 3-bit index with one value (e.g., “100”) indicating an 8-bit “medium magnitude” integer for a 16-bit integer matrix element, a 3-bit index with one value (e.g., “110”) indicating an 8-bit “small magnitude” integer for a 32-bit integer matrix element, etc.

Weight 322 includes eight bits represented as w:0, w:1, w:2, w:3, w:4, w:5, w:6, w:7 (LSB to MSB). Generally, weight 322 has a value between 0 to 255 (decimal). When the value of weight 322 is zero or less than a lower threshold value, such as, for example, 1, 2, etc., then weight 322 may be encoded as a 0-bit “zero magnitude” value, i.e., weight 322 ¹, with an associated 2-bit index 324 ¹ (i.e., “00”). When the value of weight 322 is greater than the lower threshold value but less than or equal to an upper threshold value, such as, for example, 15, then weight 322 may be encoded as a “small magnitude” unsigned integer value, i.e., weight 322 ², with an associated 2-bit index 324 ² (i.e., “10”). When the value of weight 322 is greater than the upper threshold value, then weight 322 may be encoded as a “large magnitude” unsigned integer value, i.e., weight 322 ³, with an associated 2-bit index 324 ³ (i.e., “11”). In one embodiment, when the value of weight 322 is greater than the upper threshold value and the LSB nibble is desired to be retained (for accuracy, etc.), then weight 322 may be encoded as a “full magnitude” unsigned integer value, i.e., weight 322 ³, with an associated 2-bit index 324 ³ (i.e., “01”).

In another embodiment, the matrix element encoding process may be hierarchical, which advantageously reduces the index overhead when the data has high sparsity, i.e., when most of the matrix elements are zero. In this embodiment, each matrix element has an associated index with one or two bits. The first bit indicates whether the matrix element has a zero value or a non-zero value. If the matrix element has a non-zero value, then the second bit indicates whether the matrix element has been encoded as a “small magnitude” value or a “large magnitude” value.

FIG. 7B depicts basic block encoding process 360, according to an embodiment of the present disclosure.

Generally, each basic block matrix 312 ^(i) _(j) is encoded into a fixed-size encoded block 340 ^(i) _(j) that includes a fixed-size data field 332 ^(i) _(j) and a fixed-size index field 334 ^(i) _(j). The size of data field 332 ^(i) _(j) is based on the amount of compression desired and the sparsity of the data within basic block matrices 312 ^(i) _(j), while the size of index field 334 ^(i) _(j) depends upon the number of elements within each basic block matrix 312 ^(i) _(j).

In one embodiment, the sparsity of the data is low and each basic block matrix 312 ^(i) _(j) includes eight, 8-bit elements with a total size of 64 bits (8B). The size of index field 334 ^(i) _(j) will be 16 bits (2B), and the size of data field 332 ^(i) _(j) depends on the amount of compression desired. For example, if a data compression ratio of about 0.5 is desired, then the size of data field 332 ^(i) _(j) may be set to 32 bits (4B), which accommodates a number of different encoding combinations, such as four “full magnitude” elements and four “zero magnitude” elements (4·8 bits or 4B), eight “small magnitude” elements (8·4 bits or 4B), eight “large magnitude” elements (8.4 bits or 4B), four “small magnitude” element and four “large magnitude” elements (4·4+4·4 bits or 4B), two “full magnitude” elements, four “small magnitude” elements and two “zero magnitude” elements (2·8+4·4 bits or 4B), etc. In this example, the overall compression ratio is 0.75 due to the overhead incurred by index field 334 ^(i) _(j) (i.e., 4B+2B/8B).

In another embodiment, the sparsity of the data is 50%, and each basic block matrix 312 ^(i) _(j) includes eight, 8-bit elements with a total size of 64 bits (8B). The size of index field 334 ^(i) _(j) will be 16 bits (2B), and the size of data field 332 ^(i) _(j) depends on the amount of compression. In this embodiment, four elements have zero values and four elements have non-zero values. If a data compression ratio of about 0.5 is desired, then the size of data field 332 ^(i) _(j) may be set to 32 bits (4B), which accommodates four “full magnitude” elements, as described above, without any loss of accuracy. However, the overall compression ratio is still 0.75 due to the overhead incurred by index field 334 ^(i) _(j) (i.e., 4B+2B/8B).

Advantageously, if an overall compression ratio of 0.5 is desired, then the size of data field 332 ^(i) _(j) may be set to 16 bits (2B) and each non-zero element may be encoded as a “small magnitude” element or a “large magnitude” element. The overall compression ratio is now 0.5 (i.e., 2B+2B/8B) or 2:1.

In other embodiments, the size of data field 332 ^(i) _(j) may be set to 16 bits (2B) and the four “highest-valued” non-zero elements may be encoded as “small magnitude” or “large magnitude” elements, while the remaining elements are encoded as “zero magnitude” elements. In embodiments with very sparse data, less than four elements may have non-zero values, in which case one or more elements may be encoded as a “small magnitude” element with a zero value and the proper index (i.e., index 324 ²) to ensure that four elements are encoded as “small magnitude” or “large magnitude” elements. While this accommodation introduces multiply-by-zero situations during processing, the advantages associated with reducing the memory footprint outweigh the disadvantages of the occasionally multiply-by-zero situation.

As illustrated in basic block encoding process 360, basic block matrix 312 ¹ ₁ may be encoded into a fixed-size encoded block 340 ¹ ₁ that includes a fixed-size data field 332 ¹ ₁ (2B) and a fixed-size index field 334 ¹ ₁ (2B). Each weight is encoded as a “zero magnitude” element (i.e., weight 322 ¹), a “small magnitude” element (i.e., weight 322 ²), or a “large magnitude” element (i.e., weight 322 ³), and the associated index 324 ¹, 324 ², or 324 ³ is generated. Importantly, while all of the associated indices are added to index field 334 ^(i) _(j), only those weights that are encoded as “small magnitude” or “large magnitude” elements are added to data field 332 ^(i) _(j). In other words, “zero magnitude” elements are not present within data field 332 ^(i) _(j) unless the data is very sparse, as discussed above.

More particularly, weight w¹ ₁ is encoded into weight 322 ¹, weight 322 ² or weight 322 ³ and added to data field 332 ¹ ₁ as appropriate, and the associated index is added to index 334 ¹ ₁ at bits i:0, i:1. Weight w² ₁ is encoded into weight 322 ¹, weight 322 ² or weight 322 ³ and added to data field 332 ¹ ₁ as appropriate, and the associated index is added to index 334 ¹ ₁ at bits i:2, i:3. Weight w³ ₁ is encoded into weight 322 ¹, weight 322 ² or weight 322 ³ and added to data field 332 ¹ ₁ as appropriate, and the associated index is added to index 334 ¹ ₁ at bits i:4, i:5. Weight w⁴ ₁ is encoded into weight 322 ¹, weight 322 ² or weight 322 ³ and added to data field 332 ¹ ₁ as appropriate, and the associated index is added to index 334 ¹ ₁ at bits i:6, i:7. Weight w⁵ ₁ is encoded into weight 322 ¹, weight 322 ² or weight 322 ³ and added to data field 332 ¹ ₁ as appropriate, and the associated index is added to index 334 ¹ ₁ at bits i:8, i:9. Weight w⁶ ₁ is encoded into weight 322 ¹, weight 322 ² or weight 322 ³ and added to data field 332 ¹ ₁ as appropriate, and the associated index is added to index 334 ¹ ₁ at bits i:10, i:11. Weight w⁷ ₁ is encoded into weight 322 ¹, weight 322 ² or weight 322 ³ and added to data field 332 ¹ ₁ as appropriate, and the associated index is added to index 334 ¹ ₁ at bits i:12, i:13. Weight w⁸ ₁ is encoded into weight 322 ¹, weight 322 ² or weight 322 ³ and added to data field 332 ¹ ₁ as appropriate, and the associated index is added to index 334 ¹ ₁ at bits i:14, i:15.

The result of basic block encoding process 360 is a transformation of a basic block matrix 312 into an encoded block set 340, such as, for example, basic block matrix 312 ¹ into encoded block set 340 ¹.

FIGS. 8A, 8B, 8C and 8D depict basic block encoding process 360 for basic block matrix set 312 ¹, according to an embodiment of the present invention.

In this embodiment, basic block matrix set 312 ¹ has a sparsity of 50%, which is randomly distributed throughout basic block matrices 312 ¹ _(j).

As depicted in FIG. 8A, basic block matrix 312 ¹ ₁ includes four weights with non-zero values (i.e., weights w¹ ₁, w² ₁, w³ ₁ and w⁴ ₁) and four weights with zero values (i.e., weights w⁵ ₁, w⁶ ₁, w⁷ ₁ and w⁸ ₁). Basic block matrix 312 ¹ ₂ includes four weights with non-zero values (i.e., weights w² ₂, w³ ₂, w⁴ ₂ and w⁵ ₂) and four weights with zero values (i.e., weights w¹ ₂, w⁶ ₂, w⁷ ₂ and w⁸ ₂). Basic block matrix 312 ¹ ₃ includes four weights with non-zero values (i.e., weights w³ ₃, w⁴ ₃, w⁵ ₃ and w⁶ ₃) and four weights with zero values (i.e., weights w¹ ₃, w² ₃, w⁷ ₃ and w⁸ ₃). Basic block matrix 312 ¹ ₄ includes four weights with non-zero values (i.e., weights w⁴ ₄, w⁵ ₄, w⁶ ₄ and w⁷ ₄) and four weights with zero values (i.e., weights w¹ ₄, w² ₄, w³ ₄ and w⁸ ₄). Basic block matrix 312 ¹ ₅ includes four weights with non-zero values (i.e., weights w⁵ ₅, w⁶ ₅, w⁷ ₅ and w⁸ ₅) and four weights with zero values (i.e., weights w¹ ₅, w² ₅, w³ ₅ and w⁴ ₅). Basic block matrix 312 ¹ ₆ includes four weights with non-zero values (i.e., weights w¹ ₆, w³ ₆, w⁵ ₆ and w⁷ ₆) and four weights with zero values (i.e., weights w² ₆, w⁴ ₆, w⁶ ₆ and w⁸ ₆). Basic block matrix 312 ¹ ₇ includes four weights with non-zero values (i.e., weights w² ₇, w⁴ ₇, w⁶ ₇ and w⁸ ₇) and four weights with zero values (i.e., weights w¹ ₇, w³ ₇, w⁵ ₇ and w⁷ ₇). Basic block matrix 312 ¹ ₈ includes four weights with non-zero values (i.e., weights w¹ ₈, w² ₈, w⁵ ₈ and w⁶ ₈) and four weights with zero values (i.e., weights w³ ₈, w⁴ ₈, w⁷ ₈ and w⁸ ₈).

Basic block matrix 312 ¹ ₉ includes four weights with non-zero values (i.e., weights w³ ₉, w⁴ ₉, w⁷ ₉ and w⁸ ₉) and four weights with zero values (i.e., weights w¹ ₉, w² ₉, w⁵ ₉ and w⁶ ₉). Basic block matrix 312 ¹ ₁₀ includes four weights with non-zero values (i.e., weights w¹ ₁₀, w² ₁₀, w³ ₁₀ and w⁵ ₁₀) and four weights with zero values (i.e., weights w⁴ ₁₀, w⁶ ₁₀, w⁷ ₁₀ and w⁸ ₁₀). Basic block matrix 312 ¹ ₁₁ includes four weights with non-zero values (i.e., weights w² ₁₁, w³ ₁₁, w⁴ ₁₁ and w⁶ ₁₁) and four weights with zero values (i.e., weights w¹ ₁₁, w⁵ ₁₁, w⁷ ₁₁ and w⁸ ₁₁). Basic block matrix 312 ¹ ₁₂ includes four weights with non-zero values (i.e., weights w³ ₁₂, w⁴ ₁₂, w⁵ ₁₂ and w⁷ ₁₂) and four weights with zero values (i.e., weights w¹ ₁₂, w² ₁₂, w⁶ ₁₂ and w⁸ ₁₂). Basic block matrix 312 ¹ ₁₃ includes four weights with non-zero values (i.e., weights w⁴ ₁₃, w⁵ ₁₃, w⁶ ₁₃ and w⁸ ₁₃) and four weights with zero values (i.e., weights w¹ ₁₃, w² ₁₃, w³ ₁₃ and w⁷ ₁₃). Basic block matrix 312 ¹ ₁₄ includes four weights with non-zero values (i.e., weights w¹ ₁₄, w⁵ ₁₄, w⁶ ₁₄ and w⁷ ₁₄) and four weights with zero values (i.e., weights w² ₁₄, w³ ₁₄, w⁴ ₁₄ and w⁸ ₁₄). Basic block matrix 312 ¹ ₁₅ includes four weights with non-zero values (i.e., weights w² ₁₅, w⁶ ₁₅, w⁷ ₁₅ and w⁸ ₁₅) and four weights with zero values (i.e., weights w¹ ₁₅, w³ ₁₅, w⁴ ₁₅ and w⁵ ₁₅). Basic block matrix 312 ¹ ₁₆ includes four weights with non-zero values (i.e., weights w¹ ₁₆, w³ ₁₆, w⁷ ₁₆ and w⁸ ₁₆) and four weights with zero values (i.e., weights w² ₁₆, w⁴ ₁₆, w⁵ ₁₆ and w⁶ ₁₆).

In this embodiment, each 64-bit (8B) basic block matrix 312 ¹ _(j) is encoded into a¹ ₆ bit (2B) data field 332 ¹ _(j), and each non-zero element is encoded as a 4-bit “small magnitude” element (i.e., weight 322 ²) or a 4-bit “large magnitude” element (i.e., weight 322 ³). The overall compression ratio for this embodiment is 0.5 (i.e., 2B+2B/8B) or 2:1.

As depicted in FIG. 8B, weights w¹ ₁, w² ₂, w⁴ ₄, w⁶ ₇, w⁷ ₉, w⁸ ₉, w⁵ ₁₃ and w³ ₁₆ are encoded as 4-bit “large magnitude” element (i.e., weight 322 ³, shaded background), while weights w² ₁, w³ ₁, w⁴ ₁, w³ ₂, w⁴ ₂, w⁵ ₂, w³ ₃, w⁴ ₃, w⁵ ₃, w⁶ ₃, w⁵ ₄, w⁶ ₄, w⁷ ₄, w⁵ ₅, w⁶ ₅, w⁷ ₅, w⁸ ₅, w¹ ₆, w³ ₆, w⁵ ₆, w⁷ ₆, w² ₇, w⁴ ₇, w⁸ ₇, w¹ ₈, w² ₈, w⁵ ₈, w⁶ ₈, w³ ₉, w⁴ ₉, w¹ ₁₀, w² ₁₀, w³ ₁₀, w⁵ ₁₀, w² ₁₁, w³ ₁₁, w⁴ ₁₁, w⁶ ₁₁, w³ ₁₂, w⁴ ₁₂, w⁵ ₁₂, w⁷ ₁₂, w⁴ ₁₃, w⁶ ₁₃, w⁸ ₁₃, w¹ ₁₄, w⁵ ₁₄, w⁶ ₁₄, w⁷ ₁₄, w² ₁₅, w⁶ ₁₅, w⁷ ₁₅, w⁸ ₁₅, w¹ ₁₆, w⁷ ₁₆ and w⁸ ₁₆ are encoded as 4-bit “small magnitude” element (i.e., weight 322 ², white background). The weights with zero values are encoded as “zero” magnitude unsigned integer values (not depicted for clarity).

As depicted in FIG. 8C, the non-zero encoded weights of each basic block matrix 312 ¹ _(j) have been formed into respective data fields 332 ¹ _(j), and the associated index fields 334 ¹ _(j) have been created. Each index field 334 ¹ _(j) includes eight, 2-bit index values associated with the eight weights of the respective basic block matrix 312 ¹ _(j), i.e., i¹, i², i³, i⁴, i⁵, i⁶, i⁷ and i⁸. Index i¹ is located in the LSB position, while index i⁸ is located in the MSB position.

Data field 332 ¹ ₁ includes weights w¹ ₁, w² ₁, w³ ₁ and w⁴ ₁ with non-zero values (weights w⁵ ₁, w⁶ ₁, w⁷ ₁ and w⁸ ₁ have zero values). Weight w¹ ₁ has an associated index i¹ of “11” within index field 334 ¹ ₁, weight w² ₁ has an associated index i² of “10” within index field 334 ¹ ₁, weight w³ ₁ has an associated index i³ of “10” within index field 334 ¹ ₁, and weight w⁴ ₁ has an associated index i⁴ of “10” within index field 334 ¹ ₁. Weights w⁵ ₁, w⁶ ₁, w⁷ ₁ and w⁸ ₁ have associated indices i⁵, i⁶, i⁷ and i⁸ (respectively) of “00” within index field 334 ¹ ₁. The hexadecimal value for index field 334 ¹ ₁ is 0x00AB.

Data field 332 ¹ ₂ includes weights w² ₂, w³ ₂, w⁴ ₂ and w⁵ ₂ (weights w¹ ₂, w⁶ ₂, w⁷ ₂ and w⁸ ₂ have zero values). Weight w² ₂ has an associated index i² of “11” within index field 334 ¹ ₂, weight w³ ₂ has an associated index i³ of “10” within index field 334 ¹ ₂, weight w⁴ ₂ has an associated index i⁴ of “10” within index field 334 ¹ ₂, and weight w⁵ ₂ has an associated index i⁵ of “10” within index field 334 ¹ ₂. Weights w¹ ₂, w⁶ ₂, w⁷ ₂ and w⁸ ₂ have associated indices i¹, i⁶, i⁷ and i⁸ (respectively) of “00” within index field 334 ¹ ₂. The hexadecimal value for index field 334 ¹ ₂ is 0x02AC.

Data field 332 ¹ ₃ includes weights w³ ₃, w⁴ ₃, w⁵ ₃ and w⁶ ₃ (weights w¹ ₃, w² ₃, w⁷ ₃ and w⁸ ₃ have zero values). Weight w³ ₃ has an associated index i³ of “10” within index field 334 ¹ ₃, weight w⁴ ₃ has an associated index i⁴ of “10” within index field 334 ¹ ₃, weight w⁵ ₃ has an associated index i⁵ of “10” within index field 334 ¹ ₃, and weight w⁶ ₃ has an associated index i⁶ of “10” within index field 334 ¹ ₃. Weights w¹ ₃, w² ₃, w⁷ ₃ and w⁸ ₃ have associated indices i¹, i², i⁷ and i⁸ (respectively) of “00” within index field 334 ¹ ₃. The hexadecimal value for index field 334 ¹ ₃ is 0x0AA0.

Data field 332 ¹ ₄ includes weights w⁴ ₄, w⁵ ₄, w⁶ ₄ and w⁷ ₄ (weights w¹ ₄, w² ₄, w³ ₄ and w⁸ ₄ have zero values). Weight w⁴ ₄ has an associated index i⁴ of “11” within index field 334 ¹ ₄, weight w⁵ ₄ has an associated index i⁵ of “10” within index field 334 ¹ ₄, weight w⁶ ₄ has an associated index i⁶ of “10” within index field 334 ¹ ₄, and weight w⁷ ₄ has an associated index i⁷ of “10” within index field 334 ¹ ₄. Weights w¹ ₄, w² ₄, w³ ₄ and w⁸ ₄ have associated indices i¹, i², i³ and i⁸ (respectively) of “00” within index field 334 ¹ ₄. The hexadecimal value for index field 334 ¹ ₄ is 0x2AC0.

Data field 332 ¹ ₅ includes weights w⁵ ₅, w⁶ ₅, w⁷ ₅ and w⁸ ₅ (weights w¹ ₅, w² ₅, w³ ₅ and w⁴ ₅ have zero values). Weight w⁵ ₅ has an associated index i⁵ of “10” within index field 334 ¹ ₅, weight w⁶ ₅ has an associated index i⁶ of “10” within index field 334 ¹ ₅, weight w⁷ ₅ has an associated index i⁷ of “10” within index field 334 ¹ ₅, and weight w⁸ ₅ has an associated index i⁸ of “10” within index field 334 ¹ ₅. Weights w¹ ₅, w² ₅, w³ ₅ and w⁴ ₅ have associated indices i¹, i², i³ and i⁴ (respectively) of “00” within index field 334 ¹ ₅. The hexadecimal value for index field 334 ¹ ₅ is 0xAA00.

Data field 332 ¹ ₆ includes weights w¹ ₆, w³ ₆, w⁵ ₆ and w⁷ ₆ (weights w² ₆, w⁴ ₆, w⁶ ₆ and w⁸ ₆ have zero values). Weight w¹ ₆ has an associated index i¹ of “10” within index field 334 ¹ ₆, weight w³ ₆ has an associated index i³ of “10” within index field 334 ¹ ₆, weight w⁵ ₆ has an associated index i⁵ of “10” within index field 334 ¹ ₆, and weight w⁷ ₆ has an associated index i⁷ of “10” within index field 334 ¹ ₆. Weights w² ₆, w⁴ ₆, w⁶ ₆ and w⁸ ₆ have associated indices i², i⁴, i⁶ and i⁸ (respectively) of “00” within index field 334 ¹ ₆. The hexadecimal value for index field 334 ¹ ₆ is 0x2222.

Data field 332 ¹ ₇ includes weights w² ₇, w⁴ ₇, w⁶ ₇ and w⁸ ₇ (weights w¹ ₇, w³ ₇ w⁵ ₇ and w⁷ ₇ have zero values). Weight w² ₇ has an associated index i² of “10” within index field 334 ¹ ₇, weight w⁴ ₇ has an associated index i⁴ of “10” within index field 334 ¹ ₇, weight w⁶ ₇ has an associated index i⁶ of “11” within index field 334 ¹ ₇, and weight w⁸ ₇ has an associated index i⁸ of “10” within index field 334 ¹ ₇. Weights w¹ ₇, w³ ₇, w⁵ ₇ and w⁷ ₇ have associated indices i¹, i³, i⁵ and i⁷ (respectively) of “00” within index field 334 ¹ ₇. The hexadecimal value for index field 334 ¹ ₇ is 0x8C88.

Data field 332 ¹ ₈ includes weights w¹ ₈, w² ₈, w⁵ ₈ and w⁶ ₈ (weights w³ ₈, w⁴ ₈, w¹ ₈ and w⁸ ₈ have zero values). Weight w¹ ₈ has an associated index i¹ of “10” within index field 334 ¹ ₈, weight w² ₈ has an associated index i² of “10” within index field 334 ¹ ₈, weight w⁵ ₈ has an associated index i⁵ of “10” within index field 334 ¹ ₈, and weight w⁶ ₈ has an associated index i⁶ of “10” within index field 334 ¹ ₈. Weights w³ ₈, w⁴ ₈, w⁷ ₈ and was have associated indices i³, i⁴, i⁷ and i⁸ (respectively) of “00” within index field 334 ¹ ₈. The hexadecimal value for index field 334 ¹ ₈ is 0x0A0A.

Data field 332 ¹ ₉ includes weights w³ ₉, w⁴ ₉, w⁷ ₉ and w⁸ ₉ with non-zero values (weights w¹ ₉, w² ₉, w⁵ ₉ and w⁶ ₉ have zero values). Weight w³ ₉ has an associated index i³ of “10” within index field 334 ¹ ₉, weight w⁴ ₉ has an associated index i⁴ of “10” within index field 334 ¹ ₉, weight w⁷ ₉ has an associated index i⁷ of “11” within index field 334 ¹ ₉, and weight w⁸ ₉ has an associated index is of “11” within index field 334 ¹ ₉. Weights w¹ ₉, w² ₉, w⁵ ₉ and w⁶ ₉ have associated indices i¹, i², i and i⁶ (respectively) of “00” within index field 334 ¹ ₉. The hexadecimal value for index field 334 ¹ ₉ is 0xF0A0.

Data field 332 ¹ ₁₀ includes weights w¹ ₁₀, w² ₁₀, w³ ₁₀ and w ⁵ ₁₀ with non-zero values (weights w⁴ ₁₀, w⁶ ₁₀, w⁷ ₁₀ and w ⁸ ₁₀ have zero values). Weight w¹ ₁₀ has an associated index i¹ of “10” within index field 334 ¹ ₁₀, weight w² ₁₀ has an associated index i² of “10” within index field 334 ¹ ₁₀, weight w³ ₁₀ has an associated index i³ of “10” within index field 334 ¹ ₁₀, and weight w⁵ ₁₀ has an associated index i⁵ of “10” within index field 334 ¹ ₁₀. Weights w⁴ ₁₀, w⁶ ₁₀, w⁷ ₁₀ and w⁸ ₁₀ have associated indices i⁴, i⁶, i⁷ and i⁸ (respectively) of “00” within index field 334 ¹ ₁₀. The hexadecimal value for index field 334 ¹ ₁₀ is 0x022A.

Data field 332 ¹ ₁₁ includes weights w² ₁₁, w³ ₁₁, w⁴ ₁₁ and w⁶ ₁₁ with non-zero values (weights w¹ ₁₁, w⁵ ₁₁, w⁷ ₁₁ and w⁸ ₁₁ have zero values). Weight w² ₁₁ has an associated index i² of “10” within index field 334 ¹ ₁₁, weight w³ ₁₁ has an associated index i³ of “10” within index field 334 ¹ ₁₁, weight w⁴ ₁₁ has an associated index i⁴ of “10” within index field 334 ¹ ₁₁, and weight w⁶ ₁₁ has an associated index i⁶ of “10” within index field 334 ¹ ₁₁. Weights w¹ ₁₁, w⁵ ₁₁, w⁷ ₁₁ and w⁸ ₁₁ have associated indices i¹, i⁵, i⁷ and i⁸ (respectively) of “00” within index field 334 ¹ ₁₁. The hexadecimal value for index field 334 ¹ ₁₁ is 0x08A8.

Data field 332 ¹ ₁₂ includes weights w³ ₁₂, w⁴ ₁₂, w⁵ ₁₂ and w⁷ ₁₂ with non-zero values (weights w¹ ₁₂, w² ₁₂, w⁶ ₁₂ and w⁸ ₁₂ have zero values). Weight w³ ₁₂ has an associated index i³ of “10” within index field 334 ¹ ₁₂, weight w⁴ ₁₂ has an associated index i⁴ of “10” within index field 334 ¹ ₁₂, weight w⁵ ₁₂ has an associated index i⁵ of “10” within index field 334 ¹ ₁₂, and weight w⁷ ₁₂ has an associated index i⁷ of “10” within index field 334 ¹ ₁₂. Weights w¹ ₁₂, w² ₁₂, w⁶ ₁₂ and w⁸ ₁₂ have associated indices i¹, i², i⁶ and i⁸ (respectively) of “00” within index field 334 ¹ ₁₂. The hexadecimal value for index field 334 ¹ ₁₂ is 0x22A0.

Data field 332 ¹ ₁₃ includes weights w⁴ ₁₃, w⁵ ₁₃, w⁶ ₁₃ and w⁸ ₁₃ with non-zero values (weights w¹ ₁₃, w² ₁₃, w³ ₁₃ and w⁷ ₁₃ have zero values). Weight w⁴ ₁₃ has an associated index i⁴ of “10” within index field 334 ¹ ₁₃, weight w⁵ ₁₃ has an associated index i⁵ of “11” within index field 334 ¹ ₁₃, weight w⁶ ₁₃ has an associated index i⁶ of “10” within index field 334 ¹ ₁₃, and weight w⁸ ₁₃ has an associated index i⁸ of “10” within index field 334 ¹ ₁₃. Weights w¹ ₁₃, w² ₁₃, w³ ₁₃ and w⁷ ₁₃ have associated indices i¹, i², i³ and i⁷ (respectively) of “00” within index field 334 ¹ ₁₃. The hexadecimal value for index field 334 ¹ ₁₃ is 0x8B80.

Data field 332 ¹ ₁₄ includes weights w¹ ₁₄, w⁵ ₁₄, w⁶ ₁₄ and w⁷ ₁₄ with non-zero values (weights w² ₁₄, w³ ₁₄, w⁴ ₁₄ and w⁸ ₁₄ have zero values). Weight w¹ ₁₄ has an associated index i¹ of “10” within index field 334 ¹ ₁₄, weight w⁵ ₁₄ has an associated index i⁵ of “10” within index field 334 ¹ ₁₄, weight w⁶ ₁₄ has an associated index i⁶ of “10” within index field 334 ¹ ₁₄, and weight w⁷ ₁₄ has an associated index i⁷ of “10” within index field 334 ¹ ₁₄. Weights w² ₁₄, w³ ₁₄, w⁴ ₁₄ and w⁸ ₁₄ have associated indices i², i³, i⁴ and i⁸ (respectively) of “00” within index field 334 ¹ ₁₄. The hexadecimal value for index field 334 ¹ ₁₄ is 0x2A02.

Data field 332 ¹ ₁₅ includes weights w² ₁₅, w⁶ ₁₅, w⁷ ₁₅ and w ⁸ ₁₅ with non-zero values (weights w¹ ₁₅, w³ ₁₅, w⁴ ₁₅ and w⁵ ₁₅₁ have zero values). Weight w² ₁₅ has an associated index i² of “10” within index field 334 ¹ ₁₅, weight w⁶ ₁₅ has an associated index i⁶ of “10” within index field 334 ¹ ₁₅, weight w⁷ ₁₅ has an associated index i⁷ of “10” within index field 334 ¹ ₁₅, and weight w⁸ ₁₅ has an associated index i⁸ of “10” within index field 334 ¹ ₁₅. Weights w¹ ₁₅, w³ ₁₅, w⁴ ₁₅ and w⁵ ₁₅ have associated indices i¹, i³, i⁴ and i⁵ (respectively) of “00” within index field 334 ¹ ₁₅. The hexadecimal value for index field 334 ¹ ₁₅ is 0xA808.

Data field 332 ¹ ₁₆ includes weights w¹ ₁₆, w³ ₁₆, w⁷ ₁₆ and w⁸ ₁₆ with non-zero values (weights w² ₁₆, w⁴ ₁₆, w⁵ ₁₆ and w⁶ ₁₆ have zero values). Weight w¹ ₁₆ has an associated index i¹ of “10” within index field 334 ¹ ₁₆, weight w³ ₁₆ has an associated index i³ of “11” within index field 334 ¹ ₁₆, weight w⁷ ₁₆ has an associated index i⁷ of “10” within index field 334 ¹ ₁₆, and weight w⁸ ₁₆ has an associated index i⁸ of “10” within index field 334 ¹ ₁₆. Weights w² ₁₆, w⁴ ₁₆, w⁵ ₁₆ and w⁶ ₁₆ have associated indices i², i⁴, i⁵ and i⁶ (respectively) of “00” within index field 334 ¹ ₁₆. The hexadecimal value for index field 334 ¹ ₁₆ is 0xA032.

FIG. 8D depicts encoded block set 340 ¹, which includes encoded blocks 340 ¹ ₁, 340 ¹ ₂, 340 ¹ ₃, 340 ¹ ₄, 340 ¹ ₅, 340 ¹ ₆, 340 ¹ ₇, 340 ¹ ₈, 340 ¹ ₉, 340 ¹ ₁₀, 340 ¹ ₁₁, 340 ¹ ₁₂, 340 ¹ ₁₃, 340 ¹ ₁₄, 340 ¹ ₁₅ and 340 ¹ ₁₆. Each encoded block 340 ¹ _(j) includes data field d_(j) and index field i_(j).

More particularly, encoded block 340 ¹ ₁ includes data field d₁ (i.e., data field 332 ¹ ₁) and index field i₁ (i.e., index field 334 ¹ ₁), and encoded block 340 ¹ ₂ includes data field d₂ (i.e., data field 332 ¹ ₂) and index field i₂ (i.e., index field 334 ¹ ₂), encoded block 340 ¹ ₃ includes data field d₃ (i.e., data field 332 ¹ ₃) and index field i₃ (i.e., index field 334 ¹ ₃), encoded block 340 ¹ ₄ includes data field d₄ (i.e., data field 332 ¹ ₄) and index field i₄ (i.e., index field 334 ¹ ₄), encoded block 340 ¹ ₅ includes data field d₅ (i.e., data field 332 ¹ ₅) and index field i₅ (i.e., index field 334 ¹), encoded block 340 ¹ ₆ includes data field d₆ (i.e., data field 332 ¹ ₆) and index field i₆ (i.e., index field 334 ¹ ₆), encoded block 340 ¹ ₇ includes data field d₇ (i.e., data field 332 ¹ ₇) and index field i₇ (i.e., index field 334 ¹ ₇), encoded block 340 ¹ ₈ includes data field d₈ (i.e., data field 332 ¹ ₈) and index field is (i.e., index field 334 ¹ ₈), encoded block 340 ¹ ₉ includes data field d₉ (i.e., data field 332 ¹ ₉) and index field i₉ (i.e., index field 334 ¹ ₉), encoded block 340 ¹ ₁₀ includes data field d₁₀ (i.e., data field 332 ¹ ₁₀) and index field i₁₀ (i.e., index field 334 ¹ ₁₀), encoded block 340 ¹ ₁₁ includes data field d₁₁ (i.e., data field 332 ¹ ₁₁) and index field i₁₁ (i.e., index field 334 ¹ ₁₁), encoded block 340 ¹ ₁₂ includes data field d₁₂ (i.e., data field 332 ¹ ₁₂) and index field i₁₂ (i.e., index field 334 ¹ ₁₂), encoded block 340 ¹ ₁₃ includes data field d₁₃ (i.e., data field 332 ¹ ₁₃) and index field i₁₃ (i.e., index field 334 ¹ ₁₃), encoded block 340 ¹ ₁₄ includes data field d₁₄ (i.e., data field 332 ¹ ₁₄) and index field i₁₄ (i.e., index field 334 ¹ ₁₄), encoded block 340 ¹ ₁₅ includes data field d₁₅ (i.e., data field 332 ¹ ₁₅) and index field i₁₅ (i.e., index field 334 ¹ ₁₅), encoded block 340 ¹ ₁₆ includes data field d₁₆ (i.e., data field 332 ¹ ₁₆) and index field i₁₆ (i.e., index field 334 ¹ ₁₆).

Basic block matrix sets 312 ² and 312 ³ are processed in the same manner.

FIG. 9 depicts a data flow diagram 380 for a portion of a training process for CNN 15, according to an embodiment of the present disclosure.

In order to encode weight tensors 202 ¹, 202 ² and 202 ³ of filter 202 for inference, CNN 15 is first trained using basic block creation process 355, basic block encoding process 360 and encoded block conversion process 365 in the forward path of the convolutional layer calculations. For example, data flow diagram 380 depicts a portion of the training process for CNN 15 that includes converted convolutional layer calculation 210.

During training, basic block creation process 355, basic block encoding process 360 and encoded block conversion process 365 may be implemented by the processor that is hosting the training process for CNN 15, such as, for example, a central processing unit (CPU), graphics processing unit (GPU), neural processing unit (NPU), etc.

During this portion of the forward phase, input data tensor 204 is provided to converted convolutional layer calculation 210. Generally, filter 202, including weight tensors 202 ^(i), is provided to basic block creation process 355, which decomposes each weight tensor 202 ^(i) into a basic block set 302 ^(i) and then reforms each basic block set 302 ^(i) into a basic block matrix set 312 ^(i), as described above. Each basic block matrix set 312 ^(i) is then provided to basic block encoding process 360, which encodes each basic block matrix set 312 ^(i) into an encoded block set 340 ^(i) that includes encoded blocks 340 ^(i) _(j), as described above. Each encoded block set 340 ^(i) is then provided to encoded block conversion process 365 which converts the encoded block set 340 ^(i) into a reconstructed weight tensor 202 ^(i) by simply performing the steps described above in reverse order.

The reconstructed weight tensors 202 ^(i) are then provided to converted convolutional layer calculation 210, which convolves the reconstructed weight tensors 202 ^(i) and the input data tensor 204 to generate output data tensor 206, which is provided to the next layer of CNN 15. Each reconstructed weight tensor 202 ^(i) may include weights with different values than the weights of the respective weight tensor 202 ^(i) due to potential losses introduced by basic block encoding process 360.

During the backward phase, the gradients are backpropagated and weight tensors 202 ^(i) within filter 202 are updated. FIG. 9 depicts the embodiment described in detail above, in which “i” is equal to 3.

During inference, input data tensor 204 and each encoded block set 340 ^(i) are read from memory and provided to an MMA, which performs encoded block conversion process 365 to convert each encoded block set 340 ^(i) into a reconstructed weight tensor 202 ^(i), and then executes the converted convolutional layer calculation 210. For certain layers of CNN 15, the output data tensor 206 may be further processed by the MMA and then provided as the input data tensor 204 to the next convolution layer. Each encoded block set 340 ^(i) for the next convolution layer is read from memory, the MMA performs encoded block conversion process 365 to convert each encoded block set 340 ^(i) into a reconstructed weight tensor 202 ^(i), and then executes the converted convolutional layer calculation 210 for this layer.

In many embodiments, the weight tensors are sufficiently sparse, or many elements have sufficiently small magnitude values, such that transforming each basic block set into an encoded block set is essentially lossless. The encoding described above is very well suited to ANN weights, because ANN weights often have very sparse data but with an occasional large weight, which is important.

In other embodiments, transforming each basic block set to an encoded block set is lossy, and, to minimize the loss, a simple search or optimization technique may be employed during training to optimally assign the type to each element, such as, for example, a neural architecture search (NAS), a differential NAS (DNAS), Bayesian optimization, etc. In these embodiments, a small extra “fine tuning” training phase after a lossy encoding (typically quantization) may be used to recover any loss in accuracy due to the error introduced.

FIG. 10 depicts a block diagram of system 100, in accordance with an embodiment of the present disclosure.

Computer 102 includes bus 110 coupled to one or more processors 120, memory 130, I/O interfaces 140, display interface 150, one or more communication interfaces 160 and one or more MMAs 400. Generally, I/O interfaces 140 are coupled to I/O devices 142 using a wired or wireless connection, display interface 150 is coupled to display 152, and communication interface 160 is connected to network 162 using a wired or wireless connection.

Bus 110 is a communication system that transfers data between processor 120, memory 130, I/O interfaces 140, display interface 150, communication interface 160, MMA 400, as well as other components not depicted in FIG. 1 . Power connector 112 is coupled to bus 110 and a power supply (not shown).

Processor 120 includes one or more general-purpose or application-specific microprocessors that executes instructions to perform control, computation, input/output, etc. functions for computer 102. Processor 120 may include a single integrated circuit, such as a micro-processing device, or multiple integrated circuit devices and/or circuit boards working in cooperation to accomplish the functions of processor 120. In addition, processor 120 may execute computer programs or modules, such as operating system 132, software modules 134, etc., stored within memory 130. For example, software modules 134 may include an ML application, an ANN application, a CNN application, etc.

Generally, storage element or memory 130 stores instructions for execution by processor 120 and data. Memory 130 may include a variety of non-transitory computer-readable medium that may be accessed by processor 120. In various embodiments, memory 130 may include volatile and nonvolatile medium, non-removable medium and/or removable medium. For example, memory 130 may include any combination of random access memory (RAM), dynamic RAM (DRAM), static RAM (SRAM), read only memory (ROM), flash memory, cache memory, and/or any other type of non-transitory computer-readable medium.

Memory 130 contains various components for retrieving, presenting, modifying, and storing data. For example, memory 130 stores software modules that provide functionality when executed by processor 120. The software modules include operating system 132 that provides operating system functionality for computer 102. Software modules 134 provide various functionality, such as image classification using convolutional neural networks, etc. Data 136 may include data associated with operating system 132, software modules 134, etc.

I/O interfaces 140 are configured to transmit and/or receive data from I/O devices 142. I/O interfaces 140 enable connectivity between processor 120 and I/O devices 142 by encoding data to be sent from processor 120 to I/O devices 142, and decoding data received from I/O devices 142 for processor 120. Generally, data may be sent over wired and/or wireless connections. For example, I/O interfaces 140 may include one or more wired communications interfaces, such as USB, Ethernet, etc., and/or one or more wireless communications interfaces, coupled to one or more antennas, such as WiFi, Bluetooth, cellular, etc.

Generally, I/O devices 142 provide input to computer 102 and/or output from computer 102. As discussed above, I/O devices 142 are operably connected to computer 102 using a wired and/or wireless connection. I/O devices 142 may include a local processor coupled to a communication interface that is configured to communicate with computer 102 using the wired and/or wireless connection. For example, I/O devices 142 may include a keyboard, mouse, touch pad, joystick, etc.

Display interface 150 is configured to transmit image data from computer 102 to monitor or display 152.

Communication interface 160 is configured to transmit data to and from network 162 using one or more wired and/or wireless connections. Network 162 may include one or more local area networks, wide area networks, the Internet, etc., which may execute various network protocols, such as, for example, wired and/or wireless Ethernet, Bluetooth, etc. Network 162 may also include various combinations of wired and/or wireless physical layers, such as, for example, copper wire or coaxial cable networks, fiber optic networks, Bluetooth wireless networks, WiFi wireless networks, CDMA, FDMA and TDMA cellular wireless networks, etc.

MMA 400 is configured to multiply matrices and generate output matrices to support various applications implemented by software modules 134.

FIG. 11 depicts a block diagram of MMA 400, in accordance with embodiments of the present disclosure.

MMA 400 includes I/O interface 405, controller 410, memory 415, register 420, register 430, register 440 and PE array 450 including a number of PEs 452. Controller 410 is coupled to I/O interface 405, memory 415, registers 420, 430, 440 and PEs 452. Memory 415 is coupled to I/O interface 405 and registers 420, 430, 440. Register 420 is coupled to the first column of PEs 452 of PE array 450, register 430 is coupled to the first row of PEs 452 of PE array 450, and register 430 is coupled to each PE 452 of PE array 450.

I/O interface 405 is coupled to bus 110 and memory 415. I/O interface 405 includes a microcontroller that sends data to, and receives data and commands from, processor 120, memory 130, etc.

Memory 415 may include volatile and/or nonvolatile memory. For example, memory 415 may include any combination of random access memory (RAM), dynamic RAM (DRAM), static RAM (SRAM), read only memory (ROM), flash memory, cache memory, and/or any other type of non-transitory computer-readable medium.

Generally, input data tensors and encoded block sets are received from memory 130, over bus 110 via I/O interface 405, and stored in memory 415, while output data tensors stored in memory 415 and then transmitted, over bus 110 via I/O interface 405, to memory 130.

Controller 410 may be a processor, microprocessor, microcontroller, field programmable gate array (FPGA), etc., that controls the data flow and operation of MMA 400. For example, in response to commands received from one or more software modules 134 executing on processor 120, controller 410 performs load/store (L/S) instructions, memory mapped I/O (MMIO) operations, direct memory access (DMA) operations, etc., to convert each encoded block set into a reconstructed weight tensor, and then convolve each reconstructed weight tensor and an input data tensor to generate an output data matrix, in cooperation with memory 415, registers 420, 430, 440 and PE array 450.

More particularly, controller 410 converts each encoded block set into a reconstructed weight tensor by generating the basic block matrix set based on the encoded block set, and then generating the reconstructed weight tensor based on the basic block matrix set. Controller 410 then convolves each reconstructed weight tensor and the input data tensor to generate the output data matrix by converting the reconstructed weight tensor to a converted weight matrix based on a convolution operation, converting the input data tensor to a converted input data matrix based on the convolution operation, and then multiplying the converted weight matrix and the converted input data matrix to generate the output data matrix.

Register 420 provides activation elements to the PEs 452 in the first column of PE array 450, while register 430 provides weight elements to the first row of PEs 452 of PE array 450. Registers 420 and 430 may be n elements wide and m elements deep, each element being the same size as the data contained within converted output data matrix 216, such as, for example, 8 bit integer data, 16 bit integer data, 32 bit integer data, 16 bit floating point data, 16 bit Bfloat data, 32 bit floating point data, etc. In one embodiment, registers 420 and 430 are 3 elements wide and 128 elements deep, with each element storing 8-bit integer data, and each PE 452 includes an 8-bit MAC unit to comport with the embodiments discussed above.

PE array 450 includes 9 PEs 452 arranged in a 3×3 array (i.e., PE₁, PE₂, PE₃, PE₄, PE₅, PE₆, PE₇, PE₈ and PE₉); other numbers of PEs 452 and arrangements are also contemplated, such as, for example, four PEs 452 arranged in a 2×2 array, nine PEs 452 arranged in a 3×3 array, 25 PEs 452 arranged in a 5×5 array, 36 PEs 452 arranged in a 6×6 array, 49 PEs 452 arranged in a 7×7 array, 64 PEs 452 arranged in a 8×8 array, etc. Non-symmetric arrangements, such as a 2×3 array, a 3×4 array, a 4×5 array, a 4×6 array, etc., may be advantageous for certain applications. Each PE 452 calculates a dot product for one element of converted output data matrix 216.

For example, PE₁ 452 is located in the first row and the first column of PE array 450, and calculates the dot products of the 1^(st) row of converted weight matrix 212 (i.e., W₁) and the 1^(st), 4^(th) and 7^(th) columns of converted input data matrix 214 (i.e., A₁) to generate the o¹ ₁, o¹ ₄, and o¹ ₇ elements of converted output data matrix 216 ¹, as discussed above with respect to MAC unit m₁.

PE₂ 452 is located in the first row and the second column of PE array 450, and calculates the dot products of the 2^(nd) row of converted weight matrix 212 (i.e., W₂) and the 1^(st), 4^(th) and 7^(th) columns of converted input data matrix 214 (i.e., A₁) to generate the o² ₁, o² ₄, and o² ₇ elements of converted output data matrix 216 ², as discussed above with respect to MAC unit m₂.

PE₃ 452 is located in the first row and the third column of PE array 450, and calculates the dot products of the 3^(rd) row of converted weight matrix 212 (i.e., W₃) and the 1^(st), 4^(th) and 7^(th) columns of converted input data matrix 214 (i.e., A₁) to generate the o³ ₁, o³ ₄, and o³ ₇ elements of converted output data matrix 216 ³, as discussed above with respect to MAC unit m₃.

PE₄ 452 is located in the second row and the first column of PE array 450, and calculates the dot products of the 1^(st) row of converted weight matrix 212 (i.e., W₁) and the 2^(nd), 5^(th) and 8^(th) columns of converted input data matrix 214 (i.e., A₂) to generate the o¹ ₂, o¹ ₅, and o¹ ₈ elements of converted output data matrix 216 ¹, as discussed above with respect to MAC unit m₄.

PE₅ 452 is located in the second row and the second column of PE array 450, and calculates the dot products of the 2^(nd) row of converted weight matrix 212 (i.e., W₂) and the 2^(nd), 5^(th) and 8^(th) columns of converted input data matrix 214 (i.e., A₂) to generate the o² ₂, o² ₅, and o² ₈ elements of converted output data matrix 216 ², as discussed above with respect to MAC unit m₅.

PE₆ 452 is located in the second row and the third column of PE array 450, and calculates the dot products of the 3^(rd) row of converted weight matrix 212 (i.e., W₃) and the 2^(nd), 5^(th) and 8^(th) columns of converted input data matrix 214 (i.e., A₂) to generate the o³ ₂, o³ ₅, and o³ ₈ elements of converted output data matrix 216 ³, as discussed above with respect to MAC unit m₆.

PE₇ 452 is located in the third row and the first column of PE array 450, and calculates the dot products of the 1^(st) row of converted weight matrix 212 (i.e., W₁) and the 3^(rd), 6^(th) and 9^(th) columns of converted input data matrix 214 (i.e., A₃) to generate the o¹ ₃, o¹ ₆, and o¹ ₉ elements of converted output data matrix 216 ¹, as discussed above with respect to MAC unit m₇.

PE₈ 452 is located in the third row and the second column of PE array 450, and calculates the dot products of the 2^(nd) row of converted weight matrix 212 (i.e., W₂) and the 3^(rd), 6^(th) and 9^(th) columns of converted input data matrix 214 (i.e., A₃) to generate the o² ₃, o² ₆, and o² ₉ elements of converted output data matrix 216 ², as discussed above with respect to MAC unit me.

PE₉ 452 is located in the third row and the third column of PE array 450, and calculates the dot products of the 3^(rd) row of converted weight matrix 212 (i.e., W₃) and the 3^(rd), 6^(th) and 9^(th) columns of converted input data matrix 214 (i.e., A₃) to generate the o³, o³ ₆, and o³ ₉ elements of converted output data matrix 216 ³, as discussed above with respect to MAC unit m₇.

Generally, basic block encoding process 360 results in a fixed encoded block size and a fixed number of MAC operations per block, especially when the datapath is implemented with some flexibility. Advantageously, a datapath based on 8-bit MAC units can be split into two 4-bit MAC processing paths per 8-bit MAC unit to leverage a 4-bit encoding scheme.

The embodiments described herein are combinable.

In one embodiment, a system includes a memory, a processor coupled to the memory and a matrix multiply accelerator (MMA) coupled to the processor and the memory. The memory is configured to store one or more weight tensors, each weight tensor including a number of weights. The processor is configured, for each weight tensor, to generate, based on the weight tensor, a basic block matrix set including a number of basic block matrices, each basic block matrix including a number of weights; to generate, based on the basic block matrix set, an encoded block set, the encoded block set including a number of encoded blocks, each encoded block including a data field and an index field, the data field including a number of encoded weights, the index field including an index associated with each weight in the basic block matrix, the number of encoded weights being less than the number of weights in the basic block matrix, each encoded block having a same size; and to store the encoded block set in the memory. The MMA is configured to convert each encoded block set into a reconstructed weight tensor having a number of weights equal to the number of weights of the respective weight tensor, and convolve each reconstructed weight tensor and an input data tensor to generate an output data matrix.

In another embodiment of the system, each weight tensor has a height, a width and a depth equal to a number of input channels; each basic block matrix has a width of 1 and a height equal to the number of input channels; and each basic block matrix includes one weight from each input channel.

In another embodiment of the system, the index indicates an encoding type of the associated weight, the encoding type including a zero magnitude type, a small magnitude type and a large magnitude type.

In another embodiment of the system, each weight tensor includes n-bit integer elements; the zero magnitude type is an n/2-bit integer element; the small magnitude type is an n/2-bit integer element; and the large magnitude type is an n/2-bit integer element.

In another embodiment of the system, the encoding type includes a full magnitude weight type that is an n-bit integer element.

In another embodiment of the system, generate the encoded block set includes, for each weight in the basic block matrix, determine an encoding type for the weight; generate an index for the weight based on the encoding type; generate an encoded weight based on the encoding type and the weight; add the index to the index field of the encoded block; and, when the encoding type is not a zero magnitude weight type, add the encoded weight to the data field of the encoded block

In another embodiment of the system, determine an encoding type for the weight is based on a lower threshold value and an upper threshold value.

In another embodiment of the system, determine an encoding type for the weight includes select the zero magnitude weight type when the weight has a zero value or the weight has a non-zero value that is less than or equal to the lower threshold value; select the small magnitude weight type when the weight has a non-zero value that is greater than the lower threshold value and less than or equal to the upper threshold value; and select the large magnitude weight type when the weight has a non-zero value that is greater than the upper threshold value.

In another embodiment of the system, each weight in the reconstructed weight tensor has a corresponding weight in the respective weight tensor that has a same value; convert each encoded block set into a reconstructed weight tensor includes generate, based on the encoded block set, the basic block matrix set, and generate, based on the basic block matrix set, the reconstructed weight tensor; convolve each reconstructed weight tensor and an input data tensor includes convert the reconstructed weight tensor, based on a convolution operation, to a converted weight matrix, convert the input data tensor, based on the convolution operation, to a converted input data matrix, and multiply the converted weight matrix and the converted input data matrix to generate the output data matrix.

In another embodiment of the system, the MMA includes a memory; a controller configured to convert the reconstructed weight tensor to the converted weight matrix, and convert the input data tensor to the converted input data matrix; a first register configured to store at least a portion of the converted input data matrix; a second register configured to store at least a portion of the converted weight matrix; a third register configured to store at least a portion of the output data matrix; and an array of processing elements (PEs), coupled to the controller and the first, second and third registers, configured to multiply the converted weight matrix and the converted input data matrix, each PE including a multiply-and-accumulate (MAC) circuit configured to generate a dot product between one row of the converted weight matrix and one column of the converted input data matrix.

In one embodiment, a computer-based method includes, at a processor coupled to a memory storing one or more weight tensors, for each weight tensor, generating, based on the weight tensor, a basic block matrix set including a number of basic block matrices, each basic block matrix including a number of weights; generating, based on the basic block matrix set, an encoded block set, the encoded block set including a number of encoded blocks, each encoded block including a data field and an index field, the data field including a number of encoded weights, the index field including an index associated with each weight in the basic block matrix, the number of encoded weights being less than the number of weights in the basic block matrix, each encoded block having a same size; and storing the encoded block set in the memory. The method also includes, at a matrix multiply accelerator (MMA) coupled to the processor and the memory, converting each encoded block set into a reconstructed weight tensor having a number of weights equal to the number of weights of the respective weight tensor; and convolving each reconstructed weight tensor and an input data tensor to generate an output data matrix.

In another embodiment of the method, each weight tensor has a height, a width and a depth equal to a number of input channels; each basic block matrix has a width of 1 and a height equal to the number of input channels; and each basic block matrix includes one weight from each input channel.

In another embodiment of the method, the index indicates an encoding type of the associated weight, the encoding type including a zero magnitude type, a small magnitude type and a large magnitude type.

In another embodiment of the method, each weight tensor includes n-bit integer elements; the zero magnitude type is an n/2-bit integer element; the small magnitude type is an n/2-bit integer element; and the large magnitude type is an n/2-bit integer element.

In another embodiment of the method, the encoding type includes a full magnitude weight type that is an n-bit integer element.

In another embodiment of the method, generating the encoded block set includes, for each weight in the basic block matrix, determining an encoding type for the weight; generating an index for the weight based on the encoding type; generating an encoded weight based on the encoding type and the weight; adding the index to the index field of the encoded block; and when the encoding type is not a zero magnitude weight type, adding the encoded weight to the data field of the encoded block.

In another embodiment of the method, determining an encoding type for the weight is based on a lower threshold value and an upper threshold value.

In another embodiment of the method, determining an encoding type for the weight includes selecting the zero magnitude weight type when the weight has a zero value or the weight has a non-zero value that is less than or equal to the lower threshold value; selecting the small magnitude weight type when the weight has a non-zero value that is greater than the lower threshold value and less than or equal to the upper threshold value; and selecting the large magnitude weight type when the weight has a non-zero value that is greater than the upper threshold value.

In another embodiment of the method, each weight in the reconstructed weight tensor has a corresponding weight in the respective weight tensor that has a same value.

In another embodiment of the method, converting each encoded block set into a reconstructed weight tensor includes generating, based on the encoded block set, the basic block matrix set, and generating, based on the basic block matrix set, the reconstructed weight tensor; convolving each reconstructed weight tensor and the input data tensor includes converting the reconstructed weight tensor, based on a convolution operation, to a converted weight matrix, converting the input data tensor, based on the convolution operation, to a converted input data matrix, and multiplying the converted weight matrix and the converted input data matrix to generate the output data matrix.

While implementations of the disclosure are susceptible to embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific embodiments, with the understanding that the present disclosure is to be considered as an example of the principles of the disclosure and not intended to limit the disclosure to the specific embodiments shown and described. In the description above, like reference numerals may be used to describe the same, similar or corresponding parts in the several views of the drawings.

In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by “comprises . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.

Reference throughout this document to “one embodiment,” “certain embodiments,” “an embodiment,” “implementation(s),” “aspect(s),” or similar terms means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of such phrases or in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments without limitation.

The term “or” as used herein is to be interpreted as an inclusive or meaning any one or any combination. Therefore, “A, B or C” means “any of the following: A; B; C; A and B; A and C; B and C; A, B and C.” An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive. Also, grammatical conjunctions are intended to express any and all disjunctive and conjunctive combinations of conjoined clauses, sentences, words, and the like, unless otherwise stated or clear from the context. Thus, the term “or” should generally be understood to mean “and/or” and so forth. References to items in the singular should be understood to include items in the plural, and vice versa, unless explicitly stated otherwise or clear from the text.

Recitation of ranges of values herein are not intended to be limiting, referring instead individually to any and all values falling within the range, unless otherwise indicated, and each separate value within such a range is incorporated into the specification as if it were individually recited herein. The words “about,” “approximately,” or the like, when accompanying a numerical value, are to be construed as indicating a deviation as would be appreciated by one of ordinary skill in the art to operate satisfactorily for an intended purpose. Ranges of values and/or numeric values are provided herein as examples only, and do not constitute a limitation on the scope of the described embodiments. The use of any and all examples, or exemplary language (“e.g.,” “such as,” “for example,” or the like) provided herein, is intended merely to better illuminate the embodiments and does not pose a limitation on the scope of the embodiments. No language in the specification should be construed as indicating any unclaimed element as essential to the practice of the embodiments.

For simplicity and clarity of illustration, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. Numerous details are set forth to provide an understanding of the embodiments described herein. The embodiments may be practiced without these details. In other instances, well-known methods, procedures, and components have not been described in detail to avoid obscuring the embodiments described. The description is not to be considered as limited to the scope of the embodiments described herein.

In the following description, it is understood that terms such as “first,” “second,” “top,” “bottom,” “up,” “down,” “above,” “below,” and the like, are words of convenience and are not to be construed as limiting terms. Also, the terms apparatus, device, system, etc. may be used interchangeably in this text.

The many features and advantages of the disclosure are apparent from the detailed specification, and, thus, it is intended by the appended claims to cover all such features and advantages of the disclosure which fall within the scope of the disclosure. Further, since numerous modifications and variations will readily occur to those skilled in the art, it is not desired to limit the disclosure to the exact construction and operation illustrated and described, and, accordingly, all suitable modifications and equivalents may be resorted to that fall within the scope of the disclosure. 

What is claimed is:
 1. A system, comprising: a memory configured to store one or more weight tensors, each weight tensor including a number of weights; a processor, coupled to the memory, configured to: for each weight tensor: generate, based on the weight tensor, a basic block matrix set including a number of basic block matrices, each basic block matrix including a number of weights, generate, based on the basic block matrix set, an encoded block set, the encoded block set including a number of encoded blocks, each encoded block including a data field and an index field, the data field including a number of encoded weights, the index field including an index associated with each weight in the basic block matrix, the number of encoded weights being less than the number of weights in the basic block matrix, each encoded block having a same size, and store the encoded block set in the memory; and a matrix multiply accelerator (MMA), coupled to the processor and the memory, configured to: convert each encoded block set into a reconstructed weight tensor having a number of weights equal to the number of weights of the respective weight tensor, and convolve each reconstructed weight tensor and an input data tensor to generate an output data matrix.
 2. The system according to claim 1, where: each weight tensor has a height, a width and a depth equal to a number of input channels; each basic block matrix has a width of 1 and a height equal to the number of input channels; and each basic block matrix includes one weight from each input channel.
 3. The system according to claim 2, where: the index indicates an encoding type of the associated weight, the encoding type including a zero magnitude type, a small magnitude type and a large magnitude type.
 4. The system according to claim 3, where: each weight tensor includes n-bit integer elements; the zero magnitude type is an n/2-bit integer element; the small magnitude type is an n/2-bit integer element; and the large magnitude type is an n/2-bit integer element.
 5. The system according to claim 4, where: the encoding type includes a full magnitude weight type that is an n-bit integer element.
 6. The system according to claim 3, where said generate the encoded block set includes: for each weight in the basic block matrix: determine an encoding type for the weight; generate an index for the weight based on the encoding type; generate an encoded weight based on the encoding type and the weight; add the index to the index field of the encoded block; and when the encoding type is not a zero magnitude weight type, add the encoded weight to the data field of the encoded block.
 7. The system according to claim 6, where: said determine an encoding type for the weight is based on a lower threshold value and an upper threshold value.
 8. The system according to claim 7, where said determine an encoding type for the weight includes: select the zero magnitude weight type when the weight has a zero value or the weight has a non-zero value that is less than or equal to the lower threshold value; select the small magnitude weight type when the weight has a non-zero value that is greater than the lower threshold value and less than or equal to the upper threshold value; and select the large magnitude weight type when the weight has a non-zero value that is greater than the upper threshold value.
 9. The system according to claim 1, where: each weight in the reconstructed weight tensor has a corresponding weight in the respective weight tensor that has a same value; said convert each encoded block set into a reconstructed weight tensor includes: generate, based on the encoded block set, the basic block matrix set, and generate, based on the basic block matrix set, the reconstructed weight tensor; said convolve each reconstructed weight tensor and an input data tensor includes: convert the reconstructed weight tensor, based on a convolution operation, to a converted weight matrix, convert the input data tensor, based on the convolution operation, to a converted input data matrix, and multiply the converted weight matrix and the converted input data matrix to generate the output data matrix.
 10. The system according to claim 9, where the MMA includes: a memory; a controller configured to: convert the reconstructed weight tensor to the converted weight matrix, and convert the input data tensor to the converted input data matrix; a first register configured to store at least a portion of the converted input data matrix; a second register configured to store at least a portion of the converted weight matrix; a third register configured to store at least a portion of the output data matrix; and an array of processing elements (PEs), coupled to the controller and the first, second and third registers, configured to multiply the converted weight matrix and the converted input data matrix, each PE including a multiply-and-accumulate (MAC) circuit configured to generate a dot product between one row of the converted weight matrix and one column of the converted input data matrix.
 11. A computer-based method, comprising: at a processor coupled to a memory storing one or more weight tensors: for each weight tensor: generating, based on the weight tensor, a basic block matrix set including a number of basic block matrices, each basic block matrix including a number of weights, generating, based on the basic block matrix set, an encoded block set, the encoded block set including a number of encoded blocks, each encoded block including a data field and an index field, the data field including a number of encoded weights, the index field including an index associated with each weight in the basic block matrix, the number of encoded weights being less than the number of weights in the basic block matrix, each encoded block having a same size, and storing the encoded block set in the memory; and at a matrix multiply accelerator (MMA) coupled to the processor and the memory: converting each encoded block set into a reconstructed weight tensor having a number of weights equal to the number of weights of the respective weight tensor, and convolving each reconstructed weight tensor and an input data tensor to generate an output data matrix.
 12. The computer-based method according to claim 11, where: each weight tensor has a height, a width and a depth equal to a number of input channels; each basic block matrix has a width of 1 and a height equal to the number of input channels; and each basic block matrix includes one weight from each input channel.
 13. The computer-based method according to claim 12, where: the index indicates an encoding type of the associated weight, the encoding type including a zero magnitude type, a small magnitude type and a large magnitude type.
 14. The computer-based method according to claim 13, where: each weight tensor includes n-bit integer elements; the zero magnitude type is an n/2-bit integer element; the small magnitude type is an n/2-bit integer element; and the large magnitude type is an n/2-bit integer element.
 15. The computer-based method according to claim 14, where: the encoding type includes a full magnitude weight type that is an n-bit integer element.
 16. The computer-based method according to claim 13, where said generating the encoded block set includes: for each weight in the basic block matrix: determining an encoding type for the weight; generating an index for the weight based on the encoding type; generating an encoded weight based on the encoding type and the weight; adding the index to the index field of the encoded block; and when the encoding type is not a zero magnitude weight type, adding the encoded weight to the data field of the encoded block.
 17. The system according to claim 16, where: said determining an encoding type for the weight is based on a lower threshold value and an upper threshold value.
 18. The computer-based method according to claim 17, where said determining an encoding type for the weight includes: selecting the zero magnitude weight type when the weight has a zero value or the weight has a non-zero value that is less than or equal to the lower threshold value; selecting the small magnitude weight type when the weight has a non-zero value that is greater than the lower threshold value and less than or equal to the upper threshold value; and selecting the large magnitude weight type when the weight has a non-zero value that is greater than the upper threshold value.
 19. The computer-based method according to claim 18, where: each weight in the reconstructed weight tensor has a corresponding weight in the respective weight tensor that has a same value.
 20. The computer-based method according to claim 18, where: said converting each encoded block set into a reconstructed weight tensor includes: generating, based on the encoded block set, the basic block matrix set, and generating, based on the basic block matrix set, the reconstructed weight tensor; said convolving each reconstructed weight tensor and the input data tensor includes: converting the reconstructed weight tensor, based on a convolution operation, to a converted weight matrix, converting the input data tensor, based on the convolution operation, to a converted input data matrix, and multiplying the converted weight matrix and the converted input data matrix to generate the output data matrix. 